# 📚 EDA and Patterns

**🔍 Why This Lesson Matters**  
Before making decisions based on data, we must first **understand what the data is telling us**. Raw data is often **messy, incomplete, and full of patterns we might not immediately recognize**.  

**Exploratory Data Analysis (EDA)** helps us:  
✅ Detect **errors, inconsistencies, and missing values**.  
✅ Understand **distributions, relationships, and trends** in the data.  
✅ Identify **hidden patterns** that might influence our analysis.  

By exploring data before running any models, we **avoid false assumptions, misleading conclusions, and poor decision-making**.  

---

![eda2](../_static/eda2.png)
[Source & Optional Reading: EDA, geeksforgeeks](https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/)

---

**📌 What You’ll Learn in This Lesson**  
By the end of this lesson, you will be able to:  

✅ Define **Exploratory Data Analysis (EDA)** and explain why it is important.  
✅ Recognize **patterns in data** (trends, correlations, categorical distributions).  
✅ Identify **missing values, outliers, and inconsistencies**.  
✅ Understand how **time-based and categorical trends** impact analysis.  

💡 **This lesson is entirely focused on concepts**—we will not be coding yet. Instead, we will build a strong understanding of **how to think about data before analyzing it**.  

---

**📌 Why Exploratory Data Analysis (EDA)?**  
EDA is the **first step in any data analysis process**. It helps answer questions like:  

- **What does the data look like?** (Basic structure, data types, missing values).  
- **Are there errors or inconsistencies?** (Duplicates, incorrect formatting, null values).  
- **How is the data distributed?** (Skewness, normality, extreme values).  
- **Are there relationships between variables?** (Correlations, categorical trends, time-based patterns).  

✅ **Key Benefits of EDA**  

| **Benefit** | **Why It Matters** |
|------------|------------------|
| **Prevents misleading conclusions** | Ensures that we don’t analyze flawed or biased data. |
| **Helps identify important features** | Guides feature selection for better predictive modeling. |
| **Improves decision-making** | Ensures that data-driven decisions are based on reality, not assumptions. |
| **Enhances communication** | Clear visualizations and insights make data easier to explain to others. |

---


**2️⃣ Understanding Your Data and Recognizing Patterns**  
- **Statistical patterns**: Understanding distributions and central tendencies.  
- **Visual patterns**: Spotting trends through histograms, scatterplots, and boxplots.  
- **Time-based patterns**: Detecting cycles and seasonality in data.  
- **Categorical patterns**: Exploring differences across groups. 
    
---

## 🔹 EDA as a Tool for Data Inspection

Before diving into **data cleaning and issue detection**, we first need to **inspect the dataset** to understand its **structure, data types, and general properties**. This **initial assessment** is a crucial part of EDA because it helps us determine:  

✅ **What kind of data we are working with** (numeric, categorical, text, datetime).  
✅ **How the dataset is structured** (rows, columns, missing values).  
✅ **Whether any transformations or standardizations are needed** before analysis.  

By understanding these aspects **before detecting issues**, we ensure that our cleaning and analysis efforts are well-informed.

---

### **1️⃣ Understanding Dataset Structure**  

One of the first steps in EDA is **inspecting the dataset’s dimensions and general layout**.  

**📌 Key Questions to Ask**
- 🔹 How many rows and columns does the dataset have?  
- 🔹 What types of variables are present (numerical, categorical, text-based)?  
- 🔹 Are there any immediately visible inconsistencies or patterns?  

💡 **Example:**  
- A dataset with **millions of rows** might require optimization techniques before analysis.  
- A dataset with **only a few columns** may need additional feature engineering to be useful.  

---

### **2️⃣ Checking Data Types**  

Every dataset consists of **different types of data**, and recognizing them early helps us determine **which operations are possible**.  

| **Data Type** | **Description** | **Example Columns** |
|--------------|----------------|------------------|
| **Numerical** | Continuous or discrete numbers. | `likes`, `shares`, `followers` |
| **Categorical** | Distinct groups or labels. | `language`, `location`, `user_type` |
| **Text (String)** | Free-form text. | `tweet_content`, `user_bio` |
| **Datetime** | Timestamps or date-related values. | `timestamp`, `post_date` |
| **Boolean** | True/False values. | `is_verified`, `has_link` |

**📌 Why Checking Data Types Matters**  
    ✅ **Ensures numerical columns can be used for calculations.**  
    ✅ **Helps identify if categorical values need to be standardized or encoded.**  
    ✅ **Ensures dates are stored in the correct format for time-based analysis.**  

💡 **Example:** If `timestamp` is stored as a string instead of a **datetime object**, it won’t be possible to analyze **time-based trends** without first converting it.

---

### **3️⃣ Initial Data Summarization**  

After inspecting the structure and data types, the next step is to generate **summary statistics** to get a **high-level view of numerical and categorical variables**.  

**📌 Key Metrics to Examine** 

| **Metric** | **Why It’s Important** |
|-----------|---------------------|
| **Count** | Shows how many non-null values exist in each column. |
| **Mean, Median** | Helps understand the average values of numerical data. |
| **Standard Deviation** | Indicates how spread out the data is. |
| **Minimum & Maximum** | Helps detect extreme values and potential outliers. |
| **Most Frequent Categories** | Useful for understanding categorical distributions. |

💡 **Example:** If `followers` has a **mean of 30,000** but a **median of 1,500**, this suggests **a few accounts have extremely high follower counts**, indicating **skewed distribution**.

---

### **4️⃣ Exploring Unique Values in Categorical Data**  

For categorical variables, **understanding the number of unique values** is key to detecting inconsistencies and ensuring proper encoding.  

**📌 Key Questions to Ask**  
🔹 How many unique categories exist in each column?  
🔹 Are there typos or different representations of the same category?  
🔹 Are there rare categories that could be grouped together?  

💡 **Example:** If `language` contains values like `"English"`, `"ENG"`, and `"en"`, these should be **standardized** to ensure consistency.

---

**🚀 Summary: Why This Inspection Matters**  

✅ **Understand the structure and format of the dataset.**  
✅ **Identify potential formatting issues early.**  
✅ **Ensure each column is properly classified (numeric, categorical, datetime).**  

📌 **Next, we will explore common data issues that EDA can help detect and how to handle them!** 🚀  

---

## 🔹 Common Data Issues EDA Helps Detect  

Before making sense of data, we need to **detect potential issues** that could affect analysis. **EDA helps us clean and prepare data** before diving into complex models or visualizations.  

### **Common Data Issues in Raw Datasets**  
Real-world data is rarely perfect. Some of the most common problems include:  

| **Issue** | **Description** | **Why It’s a Problem** |
|-----------|----------------|------------------------|
| **Missing Data** | Some values are absent or null (`NaN`). | Leads to incomplete analysis or biased results. |
| **Duplicates** | Some rows are repeated. | Can inflate certain patterns or misrepresent trends. |
| **Inconsistent Formatting** | Different formats for dates, categories, or text. | Makes data difficult to merge, filter, or analyze. |
| **Outliers** | Some values are extremely high or low compared to the rest. | May distort averages, correlations, and model accuracy. |
| **Censored or Manipulated Data** | Some data points have been hidden or altered. | Can introduce bias in decision-making. |

---

#### **1️⃣ Identifying & Handling Missing Data**  
Missing data is one of the **biggest challenges** in data analysis.  

**🔍 Why Does Data Go Missing?**
- **Human error** – Data wasn’t collected or recorded properly.  
- **Technical issues** – System failures, incomplete uploads.  
- **Intentional omissions** – Some information wasn’t required or was removed.  

**✅ How Do We Handle Missing Data?** 

| **Scenario** | **Possible Solution** |
|-------------|--------------------|
| Missing completely at random | Ignore or remove affected rows. |
| Missing but predictable (e.g., missing age for infants) | Impute values based on logical assumptions. |
| Large amounts of missing data | Use models or external sources to fill gaps. |

💡 **Example:** If a dataset contains missing timestamps, should we remove those rows or **estimate missing dates** based on known trends?  

---

#### **2️⃣ Detecting & Removing Duplicates**  
Duplicate rows can appear due to:  
- **Data entry errors**  
- **Multiple downloads of the same dataset**  
- **Merging datasets with overlapping records**  

💡 **Example:** In a dataset of social media posts, duplicate tweets might skew engagement analysis.  

✅ **Handling Duplicates:**  
1. **Check for exact duplicates** (same values in all columns).  
2. **Check for partial duplicates** (same user and timestamp but slightly different text).  
3. **Decide whether to keep or remove duplicates based on context.**  

---

#### **3️⃣ Addressing Inconsistent Formatting**  
Datasets often contain **inconsistent formats** that make analysis difficult. These inconsistencies can appear in:  

**🔍 Types of Formatting Issues**

| **Data Type** | **Issue** | **Example** |
|-------------|---------|-------------|
| **Dates** | Different formats used in the same column | `"01-02-2023"`, `"2023/02/01"`, `"Feb 1, 2023"` |
| **Text Data** | Spacing, capitalization, typos, mixed cases | `"New York"`, `"new york"`, `"NYC"` |
| **Categorical Data** | Multiple versions of the same category | `"Male"`, `"M"`, `"male"` |
| **Numerical Data** | Different formats for decimals or thousands | `1,000` vs. `1000.0` |

✅ **How to Fix Formatting Issues**

| **Issue** | **Recommended Fix** |
|----------|---------------------|
| **Inconsistent date formats** | Convert to a standard `YYYY-MM-DD` format. |
| **Case-sensitive inconsistencies** | Standardize to lowercase or title case. |
| **Typos in categories** | Use **string matching** or **group rare categories**. |
| **Decimal & thousands separators** | Ensure uniform numerical formatting. |

💡 **Example:** If a dataset has dates recorded as `"MM/DD/YYYY"` in some rows and `"YYYY-MM-DD"` in others, **sorting by date won't work properly** until it's standardized.  

---

#### **4️⃣ Understanding Outliers & Their Impact**  
An **outlier** is a value that is **significantly higher or lower** than the rest of the data.  

**🔍 Why Do Outliers Matter?**  
❌ **Distort statistical summaries** (mean, standard deviation).  
❌ **Affect visualization scales**, making trends harder to see.  
❌ **Impact machine learning models**, causing inaccurate predictions.  

**✅ How Should We Handle Outliers?**  

| **Scenario** | **Recommended Action** |
|-------------|--------------------|
| **Outlier is a valid extreme case (e.g., viral tweet)** | **Keep it** (it represents real behavior). |
| **Outlier is a data entry mistake** | **Correct or remove it** (e.g., a user with `99999999` followers). |
| **Outlier is affecting statistical modeling** | **Apply log transformation** or Winsorization. |

💡 **Example:** If a handful of social media posts have **millions of likes**, should we analyze them separately from typical posts?  

---

#### **5️⃣ Recognizing Censored & Manipulated Data**  
Some datasets may contain **intentional omissions or altered values** due to:  
🔹 **Privacy regulations** (e.g., redacted names or locations).  
🔹 **Platform restrictions** (e.g., deleted posts).  
🔹 **Misinformation campaigns** (e.g., bots inflating engagement).  

💡 **Example:** If engagement on a certain topic is **unnaturally high**, could it be the result of **coordinated inauthentic behavior**?  

---

## 🔹 Recognizing Patterns in Data: Statistical Analysis 

Once we have inspected the dataset and addressed basic data issues, the next step in **Exploratory Data Analysis (EDA)** is to **identify patterns** that can provide meaningful insights.  

Statistical patterns help us understand **how data is distributed and how different variables relate to each other**. Recognizing these patterns early can guide our approach to data preprocessing, feature selection, and hypothesis testing.  

---

**📌 Key Statistical Patterns to Identify**

| **Pattern Type** | **What It Reveals** | **Why It’s Important** |
|-----------------|------------------|-----------------------|
| **Central Tendency** | The "typical" value in the dataset (mean, median, mode). | Helps summarize numerical data. |
| **Variability & Spread** | How much the data fluctuates (standard deviation, range, IQR). | Shows if values are consistent or widely scattered. |
| **Skewness & Distribution** | Whether the data is symmetrical or skewed. | Affects which statistical methods and transformations to apply. |
| **Correlations** | The relationship between two variables. | Helps identify dependencies and possible causal links. |
| **Trends & Seasonality** | Patterns over time (e.g., daily, weekly, seasonal). | Useful for forecasting and time-series analysis. |

---

### **1️⃣ Understanding Central Tendency (Mean, Median, Mode)**  

Central tendency refers to **where the center of the data lies**. The three main measures of central tendency are:  

| **Measure** | **Description** | **Example Use Case** |
|------------|---------------|------------------|
| **Mean (Average)** | Sum of values divided by count. | Used in engagement metrics like average likes per tweet. |
| **Median** | The middle value when sorted. | Useful when data is skewed (e.g., median income). |
| **Mode** | The most frequently occurring value. | Common for categorical data like most-used languages. |

💡 **Example:** If the **mean number of likes** on social media posts is 500 but the **median is only 150**, this suggests **a small number of posts are going viral**, skewing the average.

---

### **2️⃣ Measuring Variability & Spread**  

While central tendency tells us about the "typical" value, **variability** shows us how **spread out** the data is.  

![iqr](../_static/IQR.png)
[Source & Optional Reading: IQR, The Data School](https://www.thedataschool.co.uk/lex-devlin/basic-statistics-interquartile-range-iqr/)

| **Metric** | **What It Measures** | **Why It’s Useful** |
|-----------|------------------|------------------|
| **Range** | The difference between max and min values. | Shows the full extent of variability. |
| **Interquartile Range (IQR)** | Spread between the 25th and 75th percentile. | Helps detect outliers. |
| **Standard Deviation** | How far values deviate from the mean. | Indicates whether data is concentrated or spread out. |

💡 **Example:** If `likes` have a **high standard deviation**, it means some posts get **very few likes** while others get **thousands**.

---

### **3️⃣ Identifying Skewness & Distribution**  

The **shape of the data distribution** can impact analysis, especially for **numerical features**.  

![stats](../_static/stats.png)
[Source & Optional Reading: Understanding Measures of Central Tendency, Medium](https://medium.com/@nitesh.py/understanding-measures-of-central-tendency-mean-median-and-mode-cabb73175b29)

| **Distribution Type** | **Characteristics** | **Impact on Analysis** |
|----------------------|------------------|------------------|
| **Normal (Symmetrical)** | Mean ≈ Median ≈ Mode. Bell-shaped curve. | Many statistical methods assume normality. |
| **Right-Skewed (Positive Skew)** | Long right tail. Mean > Median. | Common in engagement metrics (most posts get few likes, some go viral). |
| **Left-Skewed (Negative Skew)** | Long left tail. Mean < Median. | Less common but can appear in certain financial or medical data. |

💡 **Example:** Social media engagement metrics like `followers` are **typically right-skewed** because a few influencers have **millions of followers**, while most users have far fewer.

---

### **4️⃣ Detecting Correlations Between Variables**  

A **correlation** measures how two variables are related.  

![corr](../_static/correlation.png)
[Source & Optional Reading: Correlation, Medium](https://medium.com/gitgirl/correlation-regression-and-probability-f4c40f94e062)

| **Correlation Type** | **What It Means** | **Example** |
|----------------------|------------------|-------------|
| **Positive Correlation** | When one value increases, the other also increases. | More followers → More likes. |
| **Negative Correlation** | When one value increases, the other decreases. | More spam-like behavior → Fewer shares. |
| **No Correlation** | No clear relationship. | Tweet length and likes may be unrelated. |

💡 **Example:** If `followers` and `likes` have a **strong positive correlation**, it suggests that **more influential users get more engagement**.

---

### **5️⃣ Recognizing Trends & Seasonality in Time-Based Data**  

If data includes timestamps, we can analyze **trends over time** to detect:  
- **Daily, weekly, or seasonal trends.**  
- **Engagement spikes or declines over time.**  
- **Recurring cycles (e.g., weekend activity vs. weekday activity).**

![time](../_static/time.jpg)
[Source & Optional Reading: Fake News!, U. South Carolina](https://sc.edu/study/colleges_schools/cic/initiatives/social_media_insights_lab/reports/2023/what_does_fake_news_mean.php)

💡 **Example:** Total volume of the conversation around misinformation/disinformation-related terms.  The graph shows the evolution of online mentions in the last 10 years.

---

## 🔹 Recognizing Patterns in Data: Visualization  

While statistical measures help summarize data numerically, visual representations allow us to quickly spot patterns that might not be obvious from raw numbers alone.
    
**Graphical representations help us:**

✅ **Detect trends, outliers, and distributions** more intuitively.  
✅ **Reveal relationships between variables** that may not be obvious in raw data.  
✅ **Understand categorical and numerical data** more effectively.  

- Histograms confirm distribution shapes detected by skewness analysis.
- Scatter plots reinforce correlation insights.
- Boxplots highlight variability and outliers found in IQR analysis.
  
This section introduces **visualization techniques in order of complexity**, starting with simple frequency plots and moving toward advanced relational and time-series visualizations.  

---

**📌 Key Visualization Techniques** 

| **Chart Type** | **Use Case** | **What It Reveals** |
|--------------|-------------|------------------|
| **Bar Chart** | Comparison of categorical variables. | Shows relative frequencies or counts of categories. |
| **Pie Chart** | Proportional comparison of categories. | Highlights dominant categories but can be misleading with many categories. |
| **Histogram** | Distribution of a single numerical variable. | Shows skewness, normality, gaps, and multi-modal trends. |
| **Boxplot** | Summary of numerical distributions with quartiles. | Highlights outliers and variability. |
| **Density Plot (KDE Plot)** | Smoothed version of a histogram. | Visualizes probability distributions. |
| **Scatter Plot** | Relationship between two numerical variables. | Detects correlations, clusters, and trends. |
| **Line Plot** | Patterns over time. | Identifies time-series trends, seasonality, and spikes. |




---

### **1️⃣ Bar Charts: Comparing Categories (Simple)**
**Best for:** Comparing **counts or averages** of categorical variables.  

![bar](../_static/bar.jpeg)
[Source & Optional Reading: Where exposure to fake news is highest, Statista](https://www.statista.com/chart/14265/where-exposure-to-fake-news-is-highest/)

💡 **Example Use Cases:**  
- Comparing **misinformation vs. fact-based tweets** to see which receives more engagement.  
- Analyzing **top 5 most shared misinformation sources**.  
- Examining **language distribution** in the dataset.  

📌 **What Bar Charts Reveal:**  
✅ Which categories are **dominant** in the dataset.  
✅ **How engagement varies** across groups (e.g., verified vs. unverified accounts).  

---

### **2️⃣ Pie Charts: Showing Proportions (Still Simple but Limited Use)**
**Best for:** Showing **percentages** or proportions of categorical data.  

![pie](../_static/pie.jpg)
[Source & Optional Reading: State of Misinfo, The Trusted Web](https://thetrustedweb.org/state-of-misinformation-2021-united-states//)

💡 **Example Use Cases:**  
- Proportion of **misinformation vs. reliable sources**.  
- Distribution of **content types** (text-only, images, videos).  

📌 **When to Use Pie Charts:**  
✅ When there are **few categories (≤5)**.  
❌ Avoid when too many slices make it hard to compare proportions.

---

### **3️⃣ Histograms: Understanding Data Distributions**  
**Best for:** Analyzing the **spread of numerical data** by grouping values into bins.  

![hist](../_static/hist.png)
[Source & Optional Reading: The Fake News Effect, Thaler](https://www.researchgate.net/figure/Histogram-of-Perceived-Veracity-of-True-and-Fake-News-on-Politicized-Topics_fig2_346614752)

💡 **Example Use Cases:**  
- Checking if **likes, shares, and comments** follow a normal distribution.  
- Identifying **skewed engagement patterns** in viral posts.  

📌 **What Histograms Reveal:**  
✅ Whether a dataset is **normally distributed, skewed, or multimodal**.  
✅ **Missing data gaps** (e.g., engagement is never exactly 50 but jumps from 49 to 51).  

---

### **4️⃣ Boxplots: Identifying Outliers and Variability**  
**Best for:** Comparing distributions across categories & detecting outliers.  

![boxplot](../_static/boxplot.png)
[Source & Optional Reading: Can citizen pressure influence politicians’ communication about climate change?, Wynes](https://www.researchgate.net/figure/Boxplot-showing-the-percentage-of-all-of-an-MPs-tweets-which-were-coded-as-pro-climate_fig1_354643823)

💡 **Example Use Cases:**  
- Comparing **engagement levels between misinformation vs. fact-based content**.  
- Checking for **extreme outliers in shares and comments**.  

📌 **What Boxplots Reveal:**  
✅ **Outliers** that may indicate **viral posts or spam activity**.  
✅ The **spread of engagement** (e.g., some sources consistently receive higher interaction).  

---

### **5️⃣ Scatter Plots: Finding Relationships Between Two Variables**  
**Best for:** Detecting **correlations, clusters, and anomalies** between two numerical variables.  

![scatter](../_static/scatter.jpg)  
[Source & Optional Reading: Social Media Analysis with AI, Khan](https://www.researchgate.net/figure/Positive-Tweets-Scatter-plot-Represent-the-distribution-of-tweets-classified-as_fig4_343685163)

💡 **Example Use Cases:**  
- Checking if **accounts with more followers** get **more likes** (correlation).  
- Identifying **clusters of bot-like behavior** (e.g., many tweets but low engagement).  

📌 **What Scatter Plots Reveal:**  
✅ **Strong or weak relationships** between variables.  
✅ **Clusters of similar data points** (which might indicate distinct behavioral groups).  

---

### **6️⃣ Line Plots: Tracking Trends Over Time**  
**Best for:** Showing **patterns over time**, detecting trends and seasonality.  

![time2](../_static/line2.jpg)
[Source & Optional Reading: Fake News!, U. South Carolina](https://sc.edu/study/colleges_schools/cic/initiatives/social_media_insights_lab/reports/2023/what_does_fake_news_mean.php)

💡 **Example Use Cases:**  
- Tracking **spikes in misinformation engagement** over election cycles.  
- Analyzing **daily vs. weekly posting behavior** of misinformation sources.  

📌 **What Line Plots Reveal:**  
✅ **Time-based trends**, such as daily engagement cycles.  
✅ **Recurring misinformation waves**, like coordinated amplification.  

---

**🚀 Looking Ahead: More Complex Visualizations**
Beyond the standard visualizations covered here, more advanced techniques are useful for analyzing **misinformation networks, amplification patterns, and content spread**. In later lessons, we will introduce:

🔹 **Heatmaps** – To visualize whether negative sentiment correlates with higher engagement.  
🔹 **Sankey Diagrams** – To visualize how misinformation flows between platforms.  
🔹 **Network Graphs** – To uncover how misinformation spreads between accounts.  
🔹 **Word Clouds** – To highlight the most commonly used words in misinformation narratives.  

---

## 🔹 Key Takeaways from EDA  

In this lesson, we explored the **foundations of Exploratory Data Analysis (EDA)** and its critical role in understanding datasets before applying statistical models or drawing conclusions.  

✅ **Statistical Patterns Help Uncover Insights** – Measures like **central tendency, variability, skewness, and correlations** allow us to interpret data meaningfully.  
✅ **Visualizations Make Data More Accessible** – Graphical representations, from **bar charts to scatter plots and heatmaps**, reveal hidden trends and relationships.  
✅ **Time-Based and Categorical Analysis Expose Key Trends** – Understanding **when and how misinformation spreads** can help identify amplification networks and engagement cycles.  
✅ **Misinformation Analysis Requires a Nuanced Approach** – EDA helps us **detect potential manipulation, identify high-risk narratives, and ensure data integrity** before making decisions. 

---

**📊 From Theory to Practice: Next Steps**  
Now that we've built a **conceptual foundation** in EDA, it's time to **put theory into action**. In the next lesson, we will introduce **Python basics**—the essential tools you’ll need for **loading, inspecting, and analyzing data programmatically**.  

---

**🔗 What’s Next: Python for Data Analysis**  
In the upcoming Python Basics lesson, we will:  
🚀 **Introduce Python programming concepts relevant to data analysis.**  
📊 **Work with libraries like Pandas, NumPy, and Matplotlib for real-world analysis.**  
🔍 **Load and explore a dataset to apply EDA techniques hands-on.**  

With this knowledge, you’ll be ready for the **full guided Python walkthrough**, where we’ll explore **patterns in an actual dataset**, mirroring the concepts from this lesson. 

---

📌 **Let’s dive into Python and start coding!** 🐍💻  


[Provide Anonymous Feedback on this Lesson Here](https://forms.gle/4ZRmNr5rmGCAR1Re6)