# Week 5: Data Discovery and Collection

## I. Introduction to Data in Decision-Making

### A. Data Wrangling Recap
Data wrangling involves a series of tasks to transform raw data into a usable format for analysis. Key stages include:
-   Data Discovery
-   Data Collection
-   Data Cleaning
-   Data Storing
-   Data Pre-processing
-   Data Transformation
-   Data Enrichment
-   Data Validation

### B. Importance of Data
Data is foundational for both modern business and scientific inquiry.

**In Business Decision-Making**:
-   **Evidence-based Decision-making**: Enables objective analysis and insight generation.
-   **Customer Insights**: Helps in understanding customer behavior and offering personalized services.
-   **Performance Optimization**: Improves operational efficiency and allows for benchmarking.
-   **Risk Management**: Aids in identifying risks and forming mitigation strategies.
-   **Innovation**: Identifies market needs and creates feedback loops for product development.

**In Scientific Inquiry**:
-   Enables quantitative and qualitative analysis.
-   Facilitates hypothesis testing and theory development.
-   Drives innovation and discovery.
-   Improves research design.

---

## II. Data Discovery

### A. Definition
Data discovery is the process of **identifying and understanding data sources** that can be used for analytical purposes. Its primary goal is to gain actionable insights into available data and understand its potential for supporting research or business objectives.

### B. Challenges in Data Discovery
-   **Volume and Complexity**: The sheer amount of data can be overwhelming.
-   **Data Quality and Silos**: Data can be inconsistent, inaccurate, or isolated in different departments.
-   **Dynamic Data**: Data is constantly evolving and changing.
-   **Privacy and Security**: Concerns about protecting sensitive information.
-   **Lack of Metadata**: Poor documentation makes data hard to understand and use.
-   **Integration Issues**: Difficulties in combining data from different systems.

### C. The Data Discovery Process
This is a series of tasks to identify, understand, and prepare data for analysis.

1.  **Identification of Data Sources**:
    -   Inventory existing data, also known as **secondary data**, which has already been collected for other purposes.
    -   Sources can be **internal** (e.g., business records, transactional data) or **external** (e.g., government publications, social media, industry reports).
    -   **Advantages of Existing Data**: Cost/time efficiency, access to broad data, and good for trend analysis.
    -   **Limitations of Existing Data**: May not be relevant, quality can be poor, and access might be difficult.

2.  **Data Profiling and Assessment**:
    -   **Understand Data Structure**:
        -   **Structured**: Highly organized, easily understood by machines (e.g., relational databases).
        -   **Unstructured**: No pre-defined model, harder to analyze (e.g., text, multimedia).
        -   **Time-Series**: Data points indexed in time order (e.g., sensor data, financial data).
        -   **Graph Data**: Represents relationships between entities (e.g., social networks).
        -   **Big Data**: Extremely voluminous data that requires advanced tools to process.
    -   **Content Exploration**: Delve into the data to understand the types of information it holds (categorical, numerical, etc.).
    -   **Quality Assessment**: Evaluate data for issues like missing values, duplicates, and inconsistencies.

3.  **Data Cataloging and Metadata Management**:
    -   Gather metadata (data about data) to understand its origin, format, and characteristics.
    -   Create a searchable catalog of data assets.
    -   Document data lineage to trace its path from source to current state.

4.  **Data Cleaning and Preparation**:
    -   **Cleansing**: Address quality issues by correcting errors, filling missing values, or removing duplicates.
    -   **Transformation**: Convert data into a suitable format for analysis (e.g., normalization, aggregation).

5.  **Data Integration and Consolidation**:
    -   Combine data from multiple sources to create a unified dataset.
    -   Harmonize formats and units to ensure consistency.

6.  **Security and Compliance**:
    -   Implement measures to protect sensitive data and ensure compliance with regulations like GDPR.
    -   Establish access controls for authorized users only.

7.  **Data Exploration and Visualization**:
    -   Conduct Exploratory Data Analysis (EDA) to find initial patterns, trends, and anomalies.
    -   Use visualization to represent data graphically for easier interpretation.

8.  **Documentation and Sharing**:
    -   Document all findings, challenges, and insights.
    -   Share findings with stakeholders to support decision-making.

---

## III. Data Collection

### A. Definition
Data collection is the **systematic process of gathering and measuring information** on variables of interest to answer research questions, test hypotheses, and evaluate outcomes.

### B. Key Data Distinctions

**1. Primary vs. Secondary Data**
-   **Primary Data**: Original data collected firsthand by the researcher for a specific purpose. The researcher controls the methodology and scope.
-   **Secondary Data**: Information that was previously collected by someone else for a different purpose.

**2. Quantitative vs. Qualitative Data**
-   **Quantitative Data**: Data that can be measured numerically. It focuses on *quantity* and is suitable for statistical analysis. Often collected via surveys and experiments.
-   **Qualitative Data**: Descriptive and conceptual data that can be observed but not measured with numbers. It is often textual or visual and helps in understanding concepts, thoughts, or experiences.

### C. Data Collection Methods

**1. For Structured Data**
-   **Surveys and Questionnaires**: Requires clear objectives, good question design, proper sampling, and pilot testing.
-   **Web Scraping**: Involves legal and ethical considerations, technical challenges, and ensuring data quality.
-   **Relational Databases**: Requires good database design, ensuring data integrity, and planning for scalability and security.
-   **APIs (Application Programming Interfaces)**: Involves reviewing documentation, handling authentication, and respecting rate limits.

**2. For Unstructured Data**
-   **Text Mining & NLP**: Requires significant data preparation, choosing the right models, and understanding context.
-   **Image and Video Collection**: Key considerations include ensuring consistent quality, diversity, accurate annotations, and managing large file sizes.
-   **Social Media and Web Content**: Must adhere to API terms of use, handle "noisy" and dynamic content, and consider sampling bias.

**3. For Semi-structured Data**
-   **JSON and XML Extraction**: Requires understanding the hierarchical structure and using appropriate parsing libraries.
-   **Logs and Sensor Data**: Must handle high volume and velocity, diverse formats, and time-sensitivity.
-   **Email and Communication Data**: Presents challenges related to privacy compliance, complex structures, and filtering noise.

---

## IV. Ethical Considerations and Privacy

Ethical principles are paramount in data collection to protect individuals' rights and maintain trust.

-   **Informed Consent**: Participants should be fully informed about the data collection in a transparent way and participate voluntarily.
-   **Privacy and Anonymity**: Personal information must be protected, often through anonymization techniques.
-   **Data Security**: Implement secure storage, transmission, and access controls.
-   **Compliance**: Adhere to all relevant laws and regulations (e.g., GDPR).
-   **Data Minimization**: Collect only the data that is necessary for the specific purpose.
-   **Equity and Fairness**: Ensure data collection is inclusive and avoids bias.
-   **Data Retention and Disposal**: Have clear policies for how long data is stored and how it is securely destroyed.

# Data Discovery and Collection
Data preprocessing pipeline, in order:  
- Data discovery  
- Collection  
- Storing  
  
- Cleaning  
- Preprocessing  
- Validation  
  
- Transformation  
- Enrichment  

In reality, we usually collect the data we need for the question/problem we have.  
In assignment, it's reversed - we find the questions/problems our data can solve.  
Insights are deep, you need something that's not immediately obvious  
Don't include just every plot. Include the most important ones  

Question 1:
Is big data structured data, unstructured data or semi-structured data?  
All 3 of them - some structured, some unstructured, some semi-structured data.  

Question 2:
Can you give 4 examples of primary data and secondary data respectively?
Primary data:
1. Surveys
2. Interviews
3. Sensor data
4. Lab experiments

Secondary data:
1. Web articles
2. Scientific papers
3. Online databases
4. Customer reviews

Question 3:
When you join a project in your final year, do you need to submit an ethics application? why?
As long as your project involves any kind of data collection or experiment, you will need an ethics application.
This is to ensure your data collection or experiment respects the participants and their data, and takes only as little and as anonymously as possible.