# 📚 Data Collection

Available: Saturday, 15 March @ 0800

In [1]:
A key focus of this lesson is **data in Python**, which will set the stage for later technical components of this course. Python is a powerful tool for data analysis, and throughout this course, we will leverage libraries such as `pandas`, `numpy`, and `matplotlib` to work with different types of data. This lesson introduces core data concepts while also familiarizing you with Python-specific data types and structures.  

📌 **For a beginner-friendly Intro to Python for Data Analysis, check out:**  
[Python for Data Analysis YouTube Playlist by Data Daft](https://www.youtube.com/playlist?list=PLiC1doDIe9rCYWmH9wIEYEXXaJ4KAi3jc)  
This playlist offers digestible explanations of Python’s core data analysis features and is a great supplementary resource.  We'll be incorporating some of their lessons into this course.


SyntaxError: invalid character '📌' (U+1F4CC) (3169125495.py, line 3)

In [2]:

**Why Python for Data Analysis?** 

Python is widely used in data science and MDM analysis because it provides:  

- **Flexibility**: Python can handle structured and unstructured data, making it ideal for analyzing social media posts, news articles, and survey data.  
- **Powerful Libraries**: Packages like `pandas` for data manipulation, `numpy` for numerical operations, and `matplotlib` for visualization simplify analysis.  
- **Automation**: Python allows us to automate data collection, preprocessing, and visualization, making large-scale MDM analysis feasible.  
- **Integration**: Python seamlessly integrates with APIs, databases, and web scraping tools, making it easier to acquire real-world MDM-related data. 

SyntaxError: invalid syntax (1556709146.py, line 1)

 Understanding the Problem: A Critical First Step

Before diving into data collection and analysis, it is essential to fully understand the misinformation problem we are investigating. A well-defined problem ensures that our research remains focused, actionable, and impactful. 

 Why Is Understanding the Problem Important?

Misidentifying the problem can lead to wasted resources, misleading conclusions, and ineffective interventions. In misinformation research, failing to define the problem properly can result in:

- **Data Overload:** Collecting excessive, unfocused data that is difficult to analyze.
- **Misinterpretation of Findings:** Drawing incorrect conclusions by analyzing data that does not align with the core research question.
- **Ineffective Solutions:** Proposing interventions that do not address the root causes of misinformation.

To mitigate these risks, researchers must take a step back and clarify what they are trying to study before collecting data.

 Key Considerations in Defining the Problem

When investigating misinformation, consider these critical questions:

- **What specific misinformation topic or narrative are we studying?** (e.g., COVID-19 vaccine misinformation, election fraud claims)
- **What are the characteristics of the misinformation we are examining?** (e.g., misleading headlines, fabricated statistics, deepfakes)
- **Who is spreading it?** (e.g., individuals, bot networks, foreign influence campaigns)
- **How is it spreading?** (e.g., social media, traditional news, word-of-mouth)
- **What is its impact?** (e.g., behavioral changes, public health risks, political polarization)

By answering these questions, we refine our understanding and establish a clear research direction.

 Formulating a Research Question

Once the problem is well understood, the next step is to develop a clear and specific research question. A well-formulated research question ensures that data collection and analysis remain focused and actionable.

 Characteristics of a Strong Research Question

A good research question should be:

- **Specific:** Clearly define the scope and focus of the research.
- **Measurable:** The question should be answerable using quantifiable data.
- **Actionable:** The findings should provide insights that can inform decision-making.

 Example Research Questions

| Broad Question | Refined, Data-Driven Question |
|---------------|--------------------------------|
| How does misinformation about COVID-19 vaccines spread? | How does the engagement (likes, shares) of COVID-19 misinformation tweets differ across geographic regions? |
| Who spreads false vaccine claims? | What is the relationship between a user's follower count and the likelihood of their tweet being shared? |
| How effective is fact-checking? | Do tweets containing COVID-19 fact-checking information receive more or less engagement than misinformation tweets? |
| Which groups are most affected? | Which geographic locations show the highest engagement with COVID-19 misinformation tweets? |

By refining broad questions into specific, measurable, and actionable research questions, we create a strong foundation for data collection and analysis.


 🔍 Our Focused Research Question Moving Forward

We've explored the problem and refined our research questions - let's focus on one specific research question for the upcoming lessons:

"How does the engagement (likes, shares) of COVID-19 misinformation tweets differ across geographic regions?"

📌 Why This Question?

    Location-based analysis allows us to compare misinformation engagement trends across different regions.
    Engagement metrics (likes, shares) help us quantify the spread and influence of misinformation.
    Structured and measurable data makes it ideal for visualization and trend analysis.

🔹 As we move forward with this lesson, we will walk through data identification, acquisition and analysis steps using this research question as our guiding example. 🚀 

In [3]:
 Question 1: Best Research Question Selection
create_multiple_choice(
    "Which of the following is the BEST example of a structured, data-driven research question?",
    [
        "A) Why do people believe COVID-19 vaccine misinformation?",
        "B) How does misinformation affect people's health choices?",
        "C) What is the engagement difference between COVID-19 vaccine misinformation tweets and their fact-checked corrections?",
        "D) Why are people hesitant to get vaccinated?"
    ],
    "C) What is the engagement difference between COVID-19 vaccine misinformation tweets and their fact-checked corrections?"
)

 Question 2: Identifying a Weak Research Question
create_multiple_choice(
    "Which of these is a weak research question for misinformation analysis?",
    [
        "A) Which Twitter accounts were most active in spreading COVID-19 vaccine misinformation between March 2020 and March 2022?",
        "B) What percentage of misinformation engagement comes from bot accounts vs. real users?",
        "C) Why do people share false information?",
        "D) How does engagement with misinformation compare to engagement with fact-checks?"
    ],
    "C) Why do people share false information?"
)

 Question 3: Selecting the Right Research Question for Bot Detection
create_multiple_choice(
    "If you want to study how bots amplify COVID-19 vaccine misinformation, which research question is most appropriate?",
    [
        "A) What are the most common emotional triggers used in vaccine misinformation?",
        "B) What percentage of accounts spreading vaccine misinformation on Twitter are bots?",
        "C) What are the political beliefs of people who share misinformation?",
        "D) How does vaccine misinformation compare between Twitter and Facebook?"
    ],
    "B) What percentage of accounts spreading vaccine misinformation on Twitter are bots?"
)

 Question 4: Matching Data Sources to Research Questions
create_multiple_choice(
    "Which dataset would BEST help answer the question: 'Which narratives gained the most traction in COVID-19 vaccine misinformation?'",
    [
        "A) Kaggle COVID-19 Fake News Dataset",
        "B) Twitter API for tweet engagement metrics",
        "C) Google Fact Check API",
        "D) Botometer API"
    ],
    "B) Twitter API for tweet engagement metrics"
)

 Question 5: Refining Research Questions
create_multiple_choice(
    "Which of the following broad research questions has been correctly refined into a data-driven research question?",
    [
        "A) Why do people believe misinformation? → How does misinformation affect people's emotions?",
        "B) How does misinformation spread online? → Which Twitter accounts were most active in sharing COVID-19 vaccine misinformation between Mar and Apr?",
        "C) Who spreads misinformation? → What makes someone share misinformation?",
        "D) How effective is fact-checking? → Why do people ignore fact-checks?"
    ],
    "B) How does misinformation spread online? → Which Twitter accounts were most active in sharing COVID-19 vaccine misinformation between Mar and Apr?"
)

SyntaxError: invalid syntax (32784406.py, line 1)

<img src="Images\data identification.png">

 🔹 Data Identification: Finding the Right Data for Analysis

Once we have defined the problem and formulated a research question, the next step is to determine what data is needed to analyze it effectively. Not all data is equally useful, and selecting the right datasets ensures the validity and reliability of our findings.

 Identifying Key Data Sources

To analyze misinformation, we need to collect relevant and structured data from sources where misinformation is actively spread and discussed. Common sources include:

🟢 **Social Media Platforms** (e.g., Twitter, Facebook, TikTok, Reddit) 
  - Useful for tracking how misinformation spreads and its engagement levels.
  - Data points: post content, user engagement (likes, shares, comments), timestamps, geolocation.

🟢 **News & Fact-Checking Databases** (e.g., PolitiFact, Snopes, FactCheck.org)
  - Helps validate misinformation claims and compare false information with factual corrections.
  - Data points: claim description, verification status, publication date, source.

🟢 **Surveys & Polls**
  - Measures public perception and belief in misinformation.
  - Data points: demographics, misinformation exposure, behavioral impact.

🟢 **Metadata & Network Analysis**
  - Identifies bot networks and coordinated campaigns.
  - Data points: user connections, retweet patterns, account creation dates.

 Example Good Data for Various Topics

| Research Question | Example Good Data Sources | Key Data Points |
|------------------|------------------------|----------------|
| How does misinformation engagement vary by platform? | Twitter API, Facebook Graph API, TikTok Data | Platform type, engagement metrics (likes, shares), misinformation content |
| How do coordinated disinformation campaigns operate? | Open-source intelligence (OSINT), Botometer, CrowdTangle | Network connections, bot detection markers, retweet/share patterns |
| What misinformation narratives about elections are most prevalent? | Election Integrity Partnership, MediaCloud, Twitter API | Misinformation topic, source credibility, user engagement |
| What factors contribute to vaccine hesitancy? | WHO Vaccine Misinformation Reports, Pew Research Surveys, Facebook Data | Misinformation topic, sentiment analysis, user demographics |
| How effective is fact-checking? | FactCheck.org, Snopes, PolitiFact | Claim description, verification status, engagement metrics |

By aligning our data sources with our research question, we ensure that our analysis is evidence-based and well-supported.


 Based on our research question—  
**"How does the engagement (likes, shares) of COVID-19 misinformation tweets differ across geographic regions?"
**—  
we need a dataset that contains:

✅ Misinformation and fact-checked content on COVID 19

✅ Engagement metrics (likes, shares, retweets, comments, etc.) on different social media platforms

✅ Temporal data (timestamps for tracking trends over time and identifying peak misinformation periods)

✅ Geolocation data (country, region, or city-level information to analyze geographic differences)

✅ User metadata (follower count, account type, potential bot indicators) to assess influence on engagement

🚀 **Next Up:** How do we collect the right misinformation data? Let’s dive into **Data Acquisition – Where and How Do We Collect Data?**

<img src="Images\data acquisition.png">

 🔹 Data Acquisition – Where and How Do We Collect Misinformation Data?

 📌 Why This Step Matters  

Now that we’ve **defined what we want to analyze** and **identified the type of data we need**, we need to determine **where to find the data** to answer our research question.

Effective **data acquisition** ensures we gather **reliable, relevant, and actionable** information while avoiding incomplete or biased datasets.

In this section, we’ll explore:

✅ **Primary vs. Secondary Data Sources** – What’s the difference, and when should we use them?  

✅ **Common Misinformation Data Sources** – Where do we get structured and unstructured data?  

✅ **Challenges in Data Collection** – What are the ethical, technical, and accessibility concerns?  

---

 🔍 **1: Primary vs. Secondary Data Sources**  

Not all data is collected the same way. We can obtain misinformation-related data from **two main sources**:

| **Data Type**      | **Definition** | **Example for MDM Analysis** | **Pros** | **Cons** |
|-------------------|--------------|--------------------------|----------|----------|
| **Primary Data**  | Data you collect directly through APIs, surveys, experiments, or web scraping. | Using **Twitter API** to extract misinformation tweets. | ✅ Customizable, ✅ More control over accuracy. | ❌ Requires technical setup, ❌ May have platform restrictions. |
| **Secondary Data** | Data collected by external organizations or researchers and made publicly available. | Using **fact-checking databases** like Snopes or Google Fact Check Explorer. | ✅ Easy access, ✅ Less resource-intensive. | ❌ May be outdated, ❌ May not fit your exact research question. |

---

 🛠️ **2: Common MDM Data Sources**  

Now, let’s explore some of the best sources for **collecting misinformation-related data**.

 **1️⃣ Social Media Data (Primary Data)**
🟢 **Best for:** Analyzing misinformation spread, engagement, bot activity, or amplification tactics.

| **Platform**          | **How to Collect Data** | **Example Use Case** |
|----------------------|----------------------|---------------------|
| **Twitter**         | API access via Twitter Developer Portal | Track how a specific false narrative spreads. |
| **Facebook & Instagram** | Meta’s CrowdTangle tool (limited availability) | Measure engagement on misinformation posts. |
| **Reddit**          | Reddit API or Pushshift API (historical data) | Analyze misinformation discussions in niche communities. |
| **TikTok & YouTube** | Manual scraping (no public API for full text) | Identify influencers amplifying false narratives. |

⚠ **Considerations:**  
🔹 Some platforms require **API access approval** (may take time).  
🔹 **Ethical concerns** – Ensure user privacy and compliance with platform policies.

---

 **2️⃣ Fact-Checking Databases (Secondary Data)**
🟢 **Best for:** Comparing misinformation claims to verified information.

| **Database**            | **How to Access** | **Example Use Case** |
|------------------------|------------------|---------------------|
| **Snopes**            | Website search   | Compare viral misinformation claims with debunks. |
| **PolitiFact**        | API or website search | Track political misinformation trends. |
| **Google Fact Check Explorer** | API or web search | Aggregate fact-checks across multiple sources. |
| **Poynter IFCN**      | Database search  | Access global fact-checking organizations' reports. |

📌 **Pro Tip:** Fact-checking databases are **great for validation** but don’t always provide engagement metrics (likes, shares, comments).

---

 **3️⃣ Web Scraping & News Archives (Primary & Secondary)**
🟢 **Best for:** Collecting misinformation from news sites, forums, or blogs.

| **Source Type**     | **How to Collect Data** | **Example Use Case** |
|--------------------|----------------------|---------------------|
| **News Sites**     | Web scraping with BeautifulSoup or Selenium | Track how misinformation headlines change over time. |
| **Conspiracy Blogs** | Scraping tools or manual collection | Analyze how narratives evolve in alternative media. |
| **Wikipedia Edits** | Wikipedia API | Detect misinformation in article edits. |

⚠ **Considerations:**  
🔹 **Ethical risks** – Scraping terms of service violations can lead to blocked access.  
🔹 **Legality** – Some sites prohibit automated scraping—check platform policies.

---

 **4️⃣ Surveys & Experimental Data (Primary Data)**
🟢 **Best for:** Measuring public perception, belief trends, and misinformation susceptibility.

| **Method**          | **How to Collect Data** | **Example Use Case** |
|-------------------|----------------------|---------------------|
| **Online Surveys** | Google Forms, Qualtrics, or MTurk | Measure belief in misinformation narratives. |
| **Controlled Experiments** | Academic studies or user testing | Test misinformation susceptibility before and after fact-checking. |

📌 **Pro Tip:** If conducting a survey, ensure **neutral question framing** to avoid biasing responses.

---


 📝 **3: Common File Types for Storage** 

 Why File Types Matter
When working with data, it is essential to understand different file types, as they determine how data is stored, shared, and processed. Many datasets used in misinformation analysis come in a variety of formats, each with its own advantages and use cases.

 Common Data File Types

 **1. CSV (.csv)** – Comma-Separated Values
- A widely used format for structured data.
- Stores tabular data in plain text with commas separating values.
- Easy to process with Python using `pandas.read_csv()`.

 **2. Excel (.xlsx, .xls)** – Microsoft Excel Spreadsheets
- Commonly used for structured data with multiple sheets.
- Supports formulas, charts, and formatting.
- Readable in Python using `pandas.read_excel()`.

 **3. JSON (.json)** – JavaScript Object Notation
- Used for structured data, especially from APIs and web sources.
- Stores data as key-value pairs, making it flexible.
- Readable in Python using `json.load()` or `pandas.read_json()`.

 **4. SQL Databases** – Structured Query Language
- Stores large-scale, relational data efficiently.
- Often used for handling misinformation datasets from multiple sources.
- Python’s `sqlite3` or `SQLAlchemy` can query SQL databases.

 **5. Parquet (.parquet)** – Optimized for Big Data
- Columnar storage format, making it faster for big data processing.
- Used in large-scale analytics and machine learning pipelines.
- Readable in Python using `pandas.read_parquet()`.

 **6. Text Files (.txt)** – Unstructured Text Data
- Common for logs, reports, or raw text analysis.
- Useful for analyzing misinformation in articles, comments, or tweets.
- Readable using standard Python file operations (`open()`, `read()`).

 **Additional Considerations**
- **APIs & Web Scraped Data:** Often comes in **JSON**, **HTML**, or **CSV** formats.
- **PDFs:** Frequently used for government reports and academic papers, but require tools like `PyPDF2` or `pdfplumber` to extract data.
- **HTML:** Webpage data may need `BeautifulSoup` for parsing.

 **Key Takeaway**
Most file types can be read and processed using Python with the right tools. Understanding these formats is essential for effective data collection and analysis in misinformation research.


 ⚠️ Challenges in Data Collection

When working with misinformation datasets, researchers often face three key challenges:

 1️⃣ Accessibility & API Restrictions

🔹 Some platforms (e.g., Facebook, Instagram) limit public API access, making direct misinformation collection difficult.

🔹 Data licensing may prevent certain datasets from being freely used.

🔹 Some sources require institutional approval (e.g., Twitter’s Academic API).

🔹 Data availability varies by region—some datasets may be inaccessible in certain countries due to government regulations.

🔹 API limitations (e.g., rate limits, paywalls) can restrict the volume of data that can be collected in a given timeframe.

 🔹 Solution:
✅ Use alternative data sources (e.g., Reddit API is more open than Facebook).

✅ Partner with research institutions for data-sharing agreements.

✅ Apply for academic API access where possible (e.g., Twitter’s Academic Research track).

✅ Consider ethical web scraping techniques while ensuring compliance with platform terms of service.

✅ Use open-source datasets compiled by misinformation research initiatives.

 2️⃣ Ethics & Privacy Concerns

🔹 User privacy – Collecting personally identifiable information (PII) without consent is unethical and, in some cases, illegal (e.g., GDPR, CCPA).

🔹 Risk of amplification – Sharing misinformation data without context may unintentionally spread false narratives further.

🔹 Ethical concerns in web scraping – Automated collection of social media data may violate terms of service and risk exposing user data.

🔹 Misinformation impact – Storing and analyzing sensitive topics (e.g., public health misinformation) requires responsible handling to avoid contributing to harm.

 🔹 Solution:
✅ Anonymize data – Remove personal identifiers before analysis.

✅ Focus on aggregate insights, not individual user data.

✅ Clearly document data collection methods to ensure ethical transparency.

✅ Obtain proper consent if using survey-based misinformation research.

✅ Apply for Institutional Review Board (IRB) approval when working with human-related misinformation data.

✅ Store datasets securely and limit access to authorized researchers only.

 3️⃣ Data Noise & Bias

🔹 Not all data is relevant or high-quality – Social media posts may contain spam, satire, or unrelated content.

🔹 Algorithmic bias – Platform engagement algorithms may skew which misinformation spreads the most.

🔹 Selection bias – Certain datasets may overrepresent specific demographics, political affiliations, or geographic regions.

🔹 Missing context – Misinformation posts may not always be labeled as such, making classification difficult.

🔹 Fact-checking lag – Real-time misinformation research may struggle with delayed fact-checking verification.

 🔹 Solution:
✅ Preprocess & clean data (we’ll cover this in Lesson 5).

✅ Cross-check multiple sources to reduce platform bias.

✅ Use diverse datasets from multiple platforms to improve representativeness.

✅ Apply machine learning techniques to filter out spam, satire, and irrelevant content.

✅ Validate misinformation classifications by cross-referencing fact-checking databases.

✅ Adjust for algorithmic bias by examining platform policies and content curation practices.

By addressing these challenges, researchers can ensure that misinformation data is collected and analyzed responsibly, leading to more accurate and impactful findings. 🚀

---

 📋 **4: Assessing Datasets**  

After identifying potential data sources, it is crucial to assess their quality, reliability, and limitations before using them for analysis. Not all datasets are created equal, and selecting the wrong data can lead to bias, misinterpretation, or misleading conclusions.

 📊 Dataset Evaluation Rubric  

To ensure a dataset is **suitable for misinformation research**, use the rubric below to **assess its quality, structure, and limitations** before proceeding.  

| **Evaluation Criteria**     | **Considerations**                                                          
|----------------------------|------------------------------------------------------------------------------
| **Relevance**              | Does the dataset align with my research question?                           
| **Data Type**              | Is the data **structured** (tables, metrics, engagement stats) or **unstructured** (tweets, articles, images, videos)? 
| **Reliability**            | Is the source credible (**research institute, fact-checking organizations**)? 
| **Access & Limitations**   | Does the dataset require **API access, institutional approval, or have significant restrictions**? | ❌ Some missing timestamps, limited geographical data |
| **Bias & Representativeness** | Does the dataset reflect a **diverse sample** or is it skewed?            
| **Quality & Completeness** | Does the dataset contain **missing values, duplicate data, or errors**?      


---


 **🔍 Course Dataset: Kaggle COVID-19 Tweet Dataset (Modified)**  

🔗 [Kaggle COVID-19 Tweets](https://www.kaggle.com/datasets/kaushiksuresh147/covidvaccine-tweets)  
File Type: CSV

---

 📊 **Dataset Evaluation: Kaggle COVID-19 Tweet Dataset (Modified)**  

| **Evaluation Criteria**     | **Considerations**                                                          | **Assessment** |
|----------------------------|-----------------------------------------------------------------------------|---------------|
| **Relevance**              | Does the dataset align with our research question?                          | ✅ High relevance—focuses on COVID-19 misinformation and fact-checking engagement. |
| **Data Type**              | Is the data structured (tables, metrics) or unstructured (tweets, images)?  | 🟡 Semi-structured—includes both tweets and metadata (engagement, timestamps). |
| **Reliability**            | Is the source credible (research institutions, fact-checkers)?              | ✅ Based on a widely used Kaggle dataset with verified misinformation labels. |
| **Access & Limitations**   | Does the dataset have API restrictions, missing data, or biases?           | ❌ Some missing timestamps, multilingual data challenges. |
| **Bias & Representativeness** | Does the dataset reflect diverse misinformation patterns?                 | ⚠️ Has gaps and distortions. |
| **Quality & Completeness** | Does the dataset contain errors, duplicates, or missing values?            | ✅ No duplicates, few missing values. |

💡 **The original Kaggle dataset is structured and mostly preprocessed, making it ideal for learning, but real-world misinformation data is rarely this clean. We've introduced modifications, creating a dataset that better reflects actual challenges in data analysis.**  

The **modified version** includes:  
🔹 **Multilingual tweets** to test language detection and NLP processing.  
🔹 **Altered timestamps** to simulate irregularities in dataset collection.  
🔹 **Randomized engagement metrics** to introduce noise and require filtering.  
🔹 **Additional misinformation labels** with potential misclassifications for accuracy testing.  

This dataset will be used in **upcoming lessons** on **exploratory data analysis, data preprocessing, and visualization**.  

In [4]:
create_multiple_choice(
    "Given the research question: 'How does misinformation about election fraud spread on Twitter?', which dataset would be the best choice?",
    [
        "A) A survey on voter beliefs regarding election fraud",
        "B) Twitter API data containing election-related tweets and engagement metrics",
        "C) Fact-checking reports from PolitiFact and Snopes",
        "D) A dataset of government election results"
    ],
    "B) Twitter API data containing election-related tweets and engagement metrics"
)
create_multiple_choice(
    "Which of the following would likely present a major challenge when trying to collect misinformation-related data from Facebook?",
    [
        "A) API access is restricted, making it difficult to extract large-scale misinformation data",
        "B) Facebook provides real-time, open access to all misinformation-related posts",
        "C) Facebook data is easy to analyze because all posts are fact-checked",
        "D) Facebook requires no ethical considerations when collecting misinformation data"
    ],
    "A) API access is restricted, making it difficult to extract large-scale misinformation data"
)
create_multiple_choice(
    "You want to analyze bot accounts spreading misinformation about COVID-19 vaccines. Which dataset would be the most useful?",
    [
        "A) A dataset containing Twitter user metadata, including account age and posting frequency",
        "B) A news article summarizing bot activity during the pandemic",
        "C) A dataset of verified government health reports",
        "D) A dataset of randomized social media posts unrelated to COVID-19"
    ],
    "A) A dataset containing Twitter user metadata, including account age and posting frequency"
)
create_multiple_choice(
    "For the research question: 'What percentage of misinformation engagement on Twitter comes from bot accounts?', which data type would be most critical?",
    [
        "A) Survey responses from Twitter users",
        "B) Metadata on Twitter accounts, including retweet patterns and bot scores",
        "C) A dataset of misinformation posts fact-checked by Snopes",
        "D) A historical database of political misinformation"
    ],
    "B) Metadata on Twitter accounts, including retweet patterns and bot scores"
)
create_multiple_choice(
    "Which of the following is a key ethical concern when collecting misinformation data?",
    [
        "A) Ensuring the dataset is structured correctly",
        "B) Avoiding personally identifiable information (PII) in collected data",
        "C) Making sure the dataset contains a mix of factual and false information",
        "D) Ensuring the dataset is large enough for meaningful analysis"
    ],
    "B) Avoiding personally identifiable information (PII) in collected data"
)


create_fill_in_the_blank(
    "When collecting misinformation data directly from social media APIs, this is an example of a ____________ (primary or secondary) data source.",
    "primary"
)


NameError: name 'create_multiple_choice' is not defined

 Understanding the Problem, Data Identification, and Data Acquisition for Your Chosen Narrative
 📌 Applying What You’ve Learned

So far, we’ve walked through a structured approach to understanding a problem, data identification, and data acquisition. Now, it’s time to apply these same steps to your own research MDM narrative.

In Assignment 2, you will:

✅ Formulate 3-5 structured research questions based on your chosen misinformation narrative.

✅ Identify relevant data sources that best align with your research questions.

✅ Critically assess dataset quality, structure, and limitations.

By completing this assignment, you will be prepared to collect, clean, and analyze misinformation data in upcoming lessons.

---

 🚀 **Next Lesson: Data Inspection, Preprocessing, and Exploratory Data Analysis (EDA)**

In the next lesson, we will:

✔ Inspect the dataset structure to understand column types, categorical vs numerical data, and overall format.

✔ Clean and preprocess our dataset to remove inconsistencies, missing values, and irrelevant data.

✔ Explore data distributions by analyzing engagement patterns, misinformation prevalence, and geographic trends.

✔ Visualize key trends using charts, histograms, and heatmaps to identify patterns in misinformation spread.

✔ Detect outliers and anomalies in engagement metrics to uncover potential bot activity or coordinated disinformation 
campaigns.