# Data Inspection and Preprocessing

<img src="Images\now what.png">

Before We Get Into Data Analysis… 📝
Let's do a quick python primer to ensure we have the foundational knowledge needed.

---

What is Python? 🐍

Python is a **high-level, interpreted programming language** known for its simplicity and readability. It is widely used in fields like web development, data science, artificial intelligence, and automation.

**Why Use Python?**
- **Easy to Learn** – Uses simple and intuitive syntax.
- **Versatile** – Supports web development, data analysis, machine learning, and more.
- **Large Community** – Extensive libraries and active developer support.
- **Cross-Platform** – Runs on Windows, macOS, Linux, and more.

Python allows programmers to write fewer lines of code compared to other languages. For example, a simple "Hello, World!" program in Python:

In [1]:
print("Hello, World!")

Hello, World!


Data Types in Python 📊

Data types categorize values, allowing Python to interpret and process them correctly. Understanding data types is essential for effective data handling, as each type supports specific operations and behaviors.

---

**1. Strings (`str`) 📝**
- Strings are sequences of characters, used to represent text.  
- Enclosed in quotes

    **Examples:**

        "Alice"  
        "Hello, world!"

2. Numbers (int, float) 🔢

Python supports different types of numbers:

    Integers (int): Whole numbers, positive or negative.
    Floats (float): Decimal numbers.
    
    **Examples:**
        10
        13.5

**3. Booleans (`bool`)** ✅
- `True` or `False` values used in logic.


**4. Lists (`list`)** 📋
- Ordered, mutable (changeable) collections.

    **Examples:**
        fruits = ["apple", "banana", "cherry"]
        colors = ["red", "blue", "yellow"]


**5. Tuples (`tuple`)** 🔄
- Ordered, immutable (cannot be changed) collections.
    
    **Examples:**
        days = ("Monday", "Tuesday", "Wednesday")
        coordinates = (40.7128, -74.0060)

**6. Dictionaries (`dict`)** 🔑
- Key-value pairs, useful for structured data.
    **Examples:**
        person = {
                "name": "Alice",
                "age": 30,
                "city": "New York"
        }
        
        student = {
                "name": "John",
                "grades": {"math": 90, "science": 85, "history": 88},
                "graduated": False
        }
        
        product = {
                "id": 101,
                "name": "Laptop",
                "price": 799.99,
                "in_stock": True
        }

Summary 📌

| **Data Type**  | **Description**                  | **Example**            |
|---------------|----------------------------------|------------------------|
| `str` (String) | Text data                        | `"Hello"`              |
| `int` (Integer) | Whole numbers                   | `42`                   |
| `float` (Float) | Decimal numbers                 | `3.14`                 |
| `bool` (Boolean) | True/False values               | `True`                 |
| `list` (List)  | Ordered, mutable collection      | `["apple", "banana"]`  |
| `tuple` (Tuple) | Ordered, immutable collection   | `("red", "blue")`      |
| `dict` (Dictionary) | Key-value pairs             | `{"name": "Alice"}`    |


In [2]:
Question 1: Identifying a String
create_multiple_choice(
    "What is the data type of the following value: 'Hello, World!'?",
    choices=["int", "float", "str", "bool"],
    correct_answer="str"
)

Question 2: Identifying a Boolean
create_multiple_choice(
    "What is the data type of the following value: True?",
    choices=["str", "int", "bool", "tuple"],
    correct_answer="bool"
)

Question 3: Identifying an Integer
create_multiple_choice(
    "What is the data type of the following value: 42?",
    choices=["int", "float", "str", "bool"],
    correct_answer="int"
)

Question 4: Identifying a Float
create_multiple_choice(
    "What is the data type of the following value: 3.14?",
    choices=["int", "float", "str", "list"],
    correct_answer="float"
)


Question 5: Identifying a Dictionary
create_multiple_choice(
    "What is the data type of the following value: {'name': 'Alice', 'age': 25}?",
    choices=["list", "tuple", "dict", "set"],
    correct_answer="dict"
)


SyntaxError: invalid syntax (2653594479.py, line 1)

Variables 🏷️
Variables store data in Python. 

Variables store and manage data in Python, allowing programmers to assign values to names, making it easier to reference and manipulate data throughout a program. They can hold different types of data, including numbers, text, lists, and even complex objects.

In Python, you create a variable by assigning a value using the = operator.

In [3]:
Assigning values to variables

name = "Alice"     String 
age = 30           Integer 
height = 5.7       Float
is_student = True  Boolean 

print(name)

SyntaxError: invalid syntax (1149490195.py, line 1)

Rules for Naming Variables

✔ Must start with a letter or _ (underscore).  
✔ Can contain letters, numbers, and underscores.  
✔ Cannot be a reserved keyword (e.g., def, class, if).  
✔ Use descriptive names (user_name instead of x).

Python dynamically assigns types to variables based on the values they store, eliminating the need for explicit type declarations. This flexibility makes Python highly intuitive and efficient for handling data.

You can change the value of a variable at any time

In [4]:
name = "Andrea" Changed from Alice to Andrea
print(name)

SyntaxError: invalid syntax (2531059211.py, line 1)

Variables in Action

Variables can be used in operations and functions:

In [5]:
Numeric operations
age = 30
new_age = age + 5
print(new_age)  Output: 35

String concatenation
greeting = "Hello, " + name + "!"
print(greeting)  Output: Hello, Andrea!

Boolean logic
is_adult = age > 18
print(is_adult)  Output: True

SyntaxError: invalid syntax (3544315859.py, line 1)

In [6]:
Question 1: Understanding Initial Variable Assignment
create_multiple_choice(
    "What is the value of `x` after the following code runs?\n\n"
    "x = 10\n"  
    "x = 20\n",
    choices=["10", "20", "30", "None"],
    correct_answer="20"
)

Question 2: Changing Data Types Through Reassignment
create_multiple_choice(
    "What is the data type of `y` after the following code executes?\n\n"
    "y = 5.5\n"
    "y = 10\n",
    choices=["int", "float", "str", "bool"],
    correct_answer="int"
)

Question 3: String Concatenation in Variable Reassignment
create_multiple_choice(
    "What is the final value of `name`?\n\n"
    "name = \"Alice\"\n"
    "name = name + \" Smith\"\n",
    choices=["Alice", "Smith", "Alice Smith", "Error"],
    correct_answer="Alice Smith"
)

Question 4: Boolean Reassignment and Logic
create_multiple_choice(
    "What is the value of `is_active` after the following code executes?\n\n"
    "is_active = True\n"
    "is_active = not is_active\n",
    choices=["True", "False", "None", "Error"],
    correct_answer="False"
)

Question 5: List Modification and Indexing
create_multiple_choice(
    "What is the value of `fruits[1]` after the following code runs?\n\n"
    "fruits = [\"apple\", \"banana\", \"cherry\"]\n"
    "fruits[1] = \"orange\"\n",
    choices=["banana", "orange", "cherry", "apple"],
    correct_answer="orange"
)


SyntaxError: invalid syntax (3277994668.py, line 1)

Functions in Python 🔄

A **function** is a reusable block of code that performs a specific task. Functions allow you to organize, simplify, and reuse code efficiently.

**Why Use Functions? 🤔**
✔ **Avoid repetition** – Write code once, reuse it anywhere.  
✔ **Improve readability** – Break large programs into smaller, manageable parts.  
✔ **Make debugging easier** – Isolate and fix issues in a specific function.  

---

**Defining a Function 🏗️**
You define a function using the `def` keyword, followed by the function name and parentheses `()`.


In [7]:
def greet():
    print("Hello, world!")  Function body (code inside the function)

SyntaxError: invalid syntax (1121168366.py, line 2)

To call (execute) the function, simply write its name:

In [8]:
greet()

NameError: name 'greet' is not defined

Functions with Parameters 🎯

Functions can take parameters, allowing them to work dynamically with different values.

In [9]:
def add_numbers(a, b):
    return a + b  Returns the sum of a and b

result = add_numbers(5, 10)
print(result)  Output: 15

SyntaxError: invalid syntax (176958822.py, line 2)

Multiple Parameters & Keyword Arguments (kwargs)

Functions can accept multiple parameters and keyword arguments.

In [10]:
def describe_person(name, age, city="Unknown"):
    print(f"{name} is {age} years old and lives in {city}.")

describe_person("Alice", 30, "New York")
describe_person("Bob", 25)  Uses the default city


SyntaxError: invalid syntax (1921075814.py, line 5)

Summary 📌

✔ Functions make code reusable and modular.  
✔ Use return to get results from functions.  
✔ Default parameters provide flexibility.  
✔ Lambda functions simplify one-line operations.  
✔ Functions can process lists, loops, and work with global/local variables.

Now that we understand functions, let's explore Python libraries and how they can make coding even more powerful! 🚀

Libraries in Python 📚
Libraries in Python are collections of pre-written code that extend the language’s functionality, allowing users to perform complex tasks without having to write code from scratch. These libraries can handle a wide range of functions, including numerical computations, data manipulation, statistical modeling, and data visualization. They streamline workflows, reduce development time, and enable efficient data analysis by providing optimized, reusable functions and tools tailored for specific tasks. 

Packages turn what may be **hundreds to millions** of lines of code into executables for a user.  Take, for example, the simple command of generating a random number using Python’s random package:

num = random.randint(1, 100)

This one-liner generates a random integer between 1 and 100, but behind the scenes, the full code involves complex algorithms for randomness, seeding, and number generation.

What the Full Code Looks Like Behind the Scenes

If we were to implement a basic random number generator from scratch without using random.randint(), we would have to manually implement a pseudo-random number generator (PRNG) like the Linear Congruential Generator (LCG):


In [11]:
class SimpleRandom:
    def __init__(self, seed=42):
        self.modulus = 2**32
        self.multiplier = 1664525
        self.increment = 1013904223
        self.seed = seed

    def next_random(self):
        self.seed = (self.multiplier * self.seed + self.increment) % self.modulus
        return self.seed / self.modulus  Normalize to [0,1]

    def randint(self, low, high):
        return low + int(self.next_random() * (high - low + 1))

Usage:
rand_gen = SimpleRandom()
print(rand_gen.randint(1, 100))  Output: A random number between 1 and 100

SyntaxError: invalid syntax (2153261977.py, line 10)

This is just a simplified version! The actual random module in Python is much more complex and includes Mersenne Twister, cryptographic security, and optimized performance—turning millions of lines of low-level C code into a single random.randint() call for users.

One of the greatest aspects of Python is that anyone—including you—can create and publish a new library to be implemented by others, contributing to the vast ecosystem of tools that make Python so powerful.. They help with everything from mathematical operations to data visualization and include options for games, multimedia editing, and networking.  

Common Libraries for Data Analysis:
- **NumPy** – Efficient numerical operations.
- **pandas** – Data manipulation and analysis.
- **matplotlib & seaborn** – Data visualization.
- **statsmodels** – Statistical modeling.

Sources for Different Libraries 🌐
Python libraries can be sourced from:
- **PyPI (Python Package Index)** – Official repository for Python packages.
- **Anaconda Distribution** – A package manager with pre-installed libraries for data science.
- **GitHub** – Community-driven open-source projects.

You can search for packages at [pypi.org](https://pypi.org) or explore data science tools on [Anaconda](https://www.anaconda.com/).

---

Key Takeaway 🎯

Packages abstract away the complexity of programming, making powerful functionality accessible to everyone with just a few lines of code. Without packages, programmers would have to manually write, test, and debug millions of lines of code every time they needed basic functionality.

<img src="Images\data preprocessing.png">

**Lesson 5: Data Preprocessing**

**Introduction**  

📌 Why is Data Preprocessing Critical in Misinformation Analysis?  

🔹 Data preprocessing is a crucial first step in misinformation analysis, ensuring that raw, unstructured, and often messy data is cleaned and formatted before deeper exploration.  

🔹 Misinformation datasets come from diverse sources—social media, news articles, fact-checking organizations, and academic research—and often contain inconsistencies, missing values, duplicate records, and irrelevant content.  

🔹 Without proper preprocessing, any conclusions drawn from the data could be flawed or misleading.  

🛠️ **This lesson will equip you with fundamental data preprocessing skills, applicable across various MDM narratives.**  

---

**🚀 Structure of Data Preprocessing**
Data preprocessing can be broken into **four key steps**, each crucial for preparing data for analysis:

1️⃣ **Cleaning** – Handling missing values, removing duplicates, standardizing formats.  
2️⃣ **Transforming** – Normalizing numerical data, encoding categorical variables, and restructuring datasets.  
3️⃣ **Enriching** – Adding metadata, integrating external datasets, and creating derived features.  
4️⃣ **Reducing** – Filtering out irrelevant content, reducing dimensionality, and simplifying datasets for analysis.  

Each step ensures **higher data quality, consistency, and relevance** when analyzing patterns.  

---


**1️⃣ Cleaning**
Handling Missing Values, Removing Duplicates, and Standardizing Formats**  

📌 Why is Cleaning Important?  

Misinformation datasets, especially those sourced from **social media, fact-checking websites, and news sources**, often contain **incomplete, duplicated, or inconsistent data**.  
Cleaning ensures that the data is **accurate, reliable, and ready for further analysis**.  

Key cleaning tasks include:  
✅ **Handling Missing Values** – Filling in or removing incomplete data points.  
✅ **Removing Duplicates** – Eliminating redundant misinformation claims.  
✅ **Standardizing Formats** – Ensuring consistent date formats, text casing, and categorical labels.  

---

**🛠️ Step 1: Handling Missing Values**  
**Why Does Missing Data Occur?**  

Missing data is a common issue in misinformation datasets due to:  
- **Incomplete user submissions** (e.g., missing engagement metrics on social media).  
- **Data collection limitations** (e.g., missing timestamps from scraped web content).  
- **Gaps in fact-checking labels** (e.g., unverified misinformation claims).  

**Approaches to Handling Missing Data**  
There are multiple strategies to address missing data, depending on the context.

| **Approach**          | **When to Use It**  | **Example in Misinformation Analysis** |
|-----------------------|--------------------|-------------------------------------|
| **Replace with Zero (0)** | When missing data logically represents "none" or "zero quantity." | If a tweet has missing "likes" or "shares," assume **0** instead of leaving it blank. |
| **Replace with NaN (Null)** | When missing data indicates missing information rather than a meaningful zero. | If a misinformation claim is missing a **fact-checking label**, leave it as NaN to indicate uncertainty. |
| **Statistical Imputation** | When data can be inferred based on available information. | If engagement metrics are missing, replace with the **median engagement** of similar misinformation posts. |

**💻 Python Example: Detecting and Handling Missing Values**
```python
import pandas as pd
import numpy as np

Load the dataset (Example: COVID-19 misinformation dataset)
df = pd.read_csv("misinformation_dataset.csv")

Check for missing values
print("Missing values before cleaning:\n", df.isnull().sum())

Strategy 1: Fill missing engagement metrics with 0
df['likes'] = df['likes'].fillna(0)
df['shares'] = df['shares'].fillna(0)

Strategy 2: Fill missing timestamps with a placeholder
df['timestamp'] = df['timestamp'].fillna("Unknown")

Strategy 3: Replace missing text-based misinformation labels with 'Unverified'
df['misinformation_label'] = df['misinformation_label'].fillna("Unverified")

Display cleaned data
print("Missing values after cleaning:\n", df.isnull().sum())
df.head()
