# Table of Contents
<li><a href="#Initial_exploration">Initial_exploration</a></li>
<li><a href="#Data_validation">Data_validation</a></li>
<li><a href="#Data_summarization">Data_summarization</a></li>
<li><a href="#Addressing_missing_data">Addressing_missing_data</a></li>
<li><a href="#Converting_and_analyzing_categorical_data">Converting_and_analyzing_categorical_data</a></li>
<li><a href="#Working_with_numeric_data">Working_with_numeric_data</a></li>
<li><a href="#Handling_outliers">Handling_outliers</a></li>
<li><a href="#Patterns_over_time">Patterns_over_time</a></li>
<li><a href="#Correlation">Correlation</a></li>
<li><a href="#Factor_relationships_and_distributions">Factor_relationships_and_distributions</a></li>
<li><a href="#Considerations_for_categorical_data">Considerations_for_categorical_data</a></li>
<li><a href="#Generating_new_features">Generating_new_features</a></li>
<li><a href="#Generating_hypotheses">Generating_hypotheses</a></li>

<a id='Initial_exploration'></a>
# Initial_exploration

Welcome to **EDA**! Before we apply machine learning or statistical analysis, we need to **understand our dataset**.  

---

## 1️⃣ What is Exploratory Data Analysis (EDA)?  
EDA is the process of **cleaning, summarizing, and visualizing data** to:  
✔ Identify **missing values** & **data types**.  
✔ Compute **descriptive statistics**.  
✔ Detect **patterns & correlations**.  
✔ Generate **hypotheses** for deeper analysis.  

📌 **Example:** Given a **books dataset**, we can explore:  
- **How many genres are represented?**  
- **What is the average book rating?**  
- **Are there any missing values?**  

---

## 2️⃣ Loading the Data  
Let's start by importing Pandas and loading a **CSV file**:  

```python
import pandas as pd  

# Load the dataset
books = pd.read_csv("books.csv")

# Display the first 5 rows
print(books.head())
```

👀 **What to look for?**  
✔ **Column names** (e.g., "title", "author", "genre", "rating").  
✔ **General structure** of the dataset.  

---

## 3️⃣ Checking Data Types & Missing Values  
The `.info()` method gives a quick overview:  

```python
print(books.info())
```

✔ **Column names & data types**  
✔ **Non-null values (missing data check)**  
✔ **Memory usage**  

---

## 4️⃣ Understanding Categorical Columns  
We often want to know **how many data points exist in each category**.  
For example, to check **genre distribution**:  

```python
print(books["genre"].value_counts())
```

📌 This helps us understand **which categories dominate the dataset**.  

---

## 5️⃣ Understanding Numerical Columns  
Use `.describe()` to get summary statistics:  

```python
print(books.describe())
```

🔹 **Key statistics provided:**  
✔ **Count** → Number of non-null values.  
✔ **Mean** → Average value.  
✔ **Standard Deviation** → Spread of the data.  
✔ **Min & Max** → Range of values.  
✔ **Quartiles (25%, 50%, 75%)** → Distribution insights.  

---

## 6️⃣ Visualizing Data with Histograms 📊  
Histograms help us understand **how numerical values are distributed**.  

```python
import seaborn as sns  
import matplotlib.pyplot as plt  

# Create a histogram of book ratings
sns.histplot(data=books, x="rating")

# Show the plot
plt.show()
```

👀 **What to look for?**  
✔ **Where is the data concentrated?**  
✔ **Are there outliers?**  

---

## 7️⃣ Adjusting Bin Width  
To **increase granularity**, set `binwidth=0.1`:  

```python
sns.histplot(data=books, x="rating", binwidth=0.1)
plt.show()
```

✅ **Smaller bins → More precise visualization**.  

---

## 8️⃣ Let’s Practice! 🚀  
Now, try applying these techniques to **a new dataset**:  
✔ **Load the dataset**  
✔ **Check for missing values**  
✔ **Explore categorical & numerical columns**  
✔ **Visualize key distributions**  

EDA is the **first step** in making data-driven decisions—master it! 🎯  

![image.png](attachment:3b80c59d-8bf2-4188-823c-5bd340c2fc56.png)

In [None]:
# Print the first five rows of unemployment
print(unemployment.head())

In [None]:
# Print a summary of non-missing values and data types in the unemployment DataFrame
print(unemployment.info())

![image.png](attachment:d1135dde-9153-446d-b811-995b082e04ff.png)

In [None]:
# Count the values associated with each continent in unemployment
print(unemployment['continent'].value_counts())

![image.png](attachment:92276dd3-b295-411e-8ca2-a0b4436c8a1f.png)

In [None]:
# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(x='2021', data=unemployment, binwidth=1)
plt.show()

<a id='Data_validation'></a>
# Data_validation

Before diving deep into analysis, we need to **validate our data** to ensure:  
✔ **Data types are correct**  
✔ **Categorical values match expected categories**  
✔ **Numerical values fall within expected ranges**  

---

## 1️⃣ Validating Data Types  

We use `.info()` to check **column types & missing values**:  

```python
print(books.info())
```

🔹 Alternatively, to check **only data types**:  

```python
print(books.dtypes)
```

---

## 2️⃣ Updating Data Types  

Sometimes, columns are stored in the wrong format. For example, **years should be integers**:  

```python
books["year"] = books["year"].astype(int)
print(books.dtypes)
```

📌 **Common data types & their Python names:**  
- `int` → Whole numbers  
- `float` → Decimal numbers  
- `str` → Text data  
- `bool` → True/False values  

---

## 3️⃣ Validating Categorical Data  

To check if values in a column match **expected categories**, use `.isin()`:  

```python
valid_genres = ["Fiction", "Non Fiction"]
print(books["genre"].isin(valid_genres))
```

🔍 This returns `True` or `False` for each row.  

🔹 To **invert** the results (show invalid values):  

```python
print(books[~books["genre"].isin(valid_genres)])
```

🔹 To **filter only valid rows**:  

```python
valid_books = books[books["genre"].isin(valid_genres)]
```

---

## 4️⃣ Validating Numerical Data  

To select **only numerical columns**:  

```python
numerical_data = books.select_dtypes(include="number")
print(numerical_data.head())
```

📌 **Checking numerical ranges**:  

```python
print("Min Year:", books["year"].min())
print("Max Year:", books["year"].max())
```

---

## 5️⃣ Visualizing Numerical Data  

Use a **boxplot** to detect **outliers & distributions**:  

```python
import seaborn as sns  
import matplotlib.pyplot as plt  

sns.boxplot(data=books, x="year")
plt.show()
```

📊 **Interpreting the Boxplot:**  
✔ **Min & Max** → The smallest & largest values  
✔ **25th & 75th Percentile** → The middle 50% of data  
✔ **Median** → The middle value  

🔹 **Grouping by categories**:  

```python
sns.boxplot(data=books, x="genre", y="year")
plt.show()
```

---

## 🎯 Let’s Practice!  

Now, apply these validation steps to **a new dataset**:  
✔ **Check & update data types**  
✔ **Validate categorical values**  
✔ **Check numerical ranges**  
✔ **Use visualizations to spot issues**  

Ensuring **clean data** leads to **more reliable insights**! 🚀  

![image.png](attachment:f6f88571-af20-4b67-aaf2-52cd2006c969.png)

In [None]:
# Update the data type of the 2019 column to a float
unemployment["2019"] = unemployment["2019"].astype(float)
# Print the dtypes to check your work
print(unemployment.dtypes)

![image.png](attachment:e705fd97-c6e7-413f-ad28-bdcccff80a3f.png)

In [None]:
# Define a Series describing whether each continent is outside of Oceania
not_oceania = ~unemployment["continent"].isin(["Oceania"])

# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])

![image.png](attachment:a8f90c90-f076-4d84-98a8-006b1c3c2a80.png)

In [None]:
# Print the minimum and maximum unemployment rates during 2021
print(unemployment['2021'].min(), unemployment['2021'].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(x='2021', y='continent', data=unemployment)
plt.show()

<a id='Data_summarization'></a>
# Data_summarization


After validating our data, it's time to **summarize key insights** by:  
✔ **Grouping data** to compare categories  
✔ **Applying aggregations** for numerical summaries  
✔ **Visualizing categorical comparisons**  

---

## 1️⃣ Grouping Data with `.groupby()`  

We use `.groupby()` to summarize numerical data within each **category**:  

```python
books.groupby("genre").mean()
```

🔍 This computes the **average** for all numerical columns within each genre.  

🔹 **Other aggregation functions:**  
- `.sum()` → Total sum  
- `.count()` → Number of values  
- `.min()` / `.max()` → Min & max values  
- `.var()` → Variance  
- `.std()` → Standard deviation  

---

## 2️⃣ Aggregating Ungrouped Data  

To compute **multiple summary statistics** at once:  

```python
books.agg(["mean", "std"])
```

📌 **This applies only to numeric columns!**  

---

## 3️⃣ Specifying Aggregations for Each Column  

We can **control which aggregations** apply to **which columns**:  

```python
books.agg({"rating": ["mean", "std"], "year": "median"})
```

---

## 4️⃣ Named Summary Columns  

To apply multiple aggregations **within groups** and give **custom column names**:  

```python
books.groupby("genre").agg(
    rating_mean=("rating", "mean"),
    rating_std=("rating", "std"),
    median_year=("year", "median")
)
```

🔹 **Output Example:**  

| Genre       | Rating Mean | Rating Std | Median Year |
|------------|-------------|-------------|-------------|
| Fiction     | 4.2         | 0.3         | 2012        |
| Non-Fiction | 4.5         | 0.2         | 2015        |

---

## 5️⃣ Visualizing Categorical Summaries  

A **Seaborn barplot** quickly shows the **mean rating by genre**:  

```python
import seaborn as sns
import matplotlib.pyplot as plt  

sns.barplot(data=books, x="genre", y="rating")
plt.show()
```

📊 **Interpreting the Barplot:**  
✔ **Bar height** → Average rating  
✔ **Vertical line on top** → 95% confidence interval  
✔ **Variation in bars** → Some genres have more rating fluctuations  

---

## 🎯 Let’s Practice!  

Now, apply these summarization techniques to **a new dataset**:  
✔ **Group & aggregate data**  
✔ **Customize aggregations per column**  
✔ **Use visualizations for better insights**  

Summarizing data **helps uncover trends** and **guides further analysis**! 🚀  

![image.png](attachment:5f556853-0eab-44ab-82b1-5a4306fffaae.png)

In [None]:
# Print the mean and standard deviation of rates by year
print(unemployment.agg(['mean', 'std']))

In [None]:
# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby('continent').agg(['mean', 'std']))

![image.png](attachment:5c9a160c-d0c3-48e6-862f-d8b2f47dffff.png)

In [None]:
continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021=('2021', 'mean'),
    # Create the std_rate_2021 column
    std_rate_2021=('2021', 'std')
)
print(continent_summary)

![image.png](attachment:1b2c8276-3354-4d5e-af74-1eedb7b55a87.png)

In [None]:
# Create a bar plot of continents and their average unemployment
sns.barplot(x='continent', y='2021', data=unemployment)
plt.show()

<a id='Addressing_missing_data'></a>
# Addressing_missing_data

Missing data can **skew analysis** and lead to **misleading insights**.  
This guide covers:  
✔ **Detecting missing values**  
✔ **Strategies to handle missing data**  
✔ **Using Pandas functions for imputation & filtering**  

---

## 1️⃣ Why is Missing Data a Problem?  

Missing values can affect:  
✔ **Distributions** → Biased means & incorrect conclusions  
✔ **Correlations** → Distorted relationships between variables  
✔ **Predictive models** → Reduces accuracy  

Example: If **tallest students' heights** are missing, the **average height** will be **lower than reality**.  

---

## 2️⃣ Checking for Missing Values  

To **count missing values** per column:  

```python
salaries.isna().sum()
```

🔍 **Example Output:**

| Column         | Missing Values |
|---------------|----------------|
| Salary_USD   | 60             |
| Job Title    | 10             |

---

## 3️⃣ Strategies for Addressing Missing Data  

✅ **Rule of Thumb:**  
✔ **Drop** missing rows if **≤5% of total data**  
✔ **Impute** missing values if **>5%** using:  
   - **Mean** (for normally distributed data)  
   - **Median** (for skewed data)  
   - **Mode** (for categorical data)  
✔ **Group-based Imputation** → Fill missing values using **median per group**  

---

## 4️⃣ Dropping Missing Values  

If a column has missing values in **≤5% of rows**, we **drop** them:

```python
threshold = len(salaries) * 0.05  # 5% threshold
cols_to_drop = salaries.isna().sum()[salaries.isna().sum() <= threshold].index
salaries.dropna(subset=cols_to_drop, inplace=True)
```

---

## 5️⃣ Imputing Categorical Data  

For categorical columns, we **replace missing values with mode**:

```python
for col in ["Job Title", "Employment Type", "Company Size"]:
    salaries[col].fillna(salaries[col].mode()[0], inplace=True)
```

---

## 6️⃣ Imputing Numerical Data by Group  

When **salary varies by experience level**, we impute **median salary per experience group**:

```python
median_salaries = salaries.groupby("Experience")["Salary_USD"].median().to_dict()
salaries["Salary_USD"].fillna(salaries["Experience"].map(median_salaries), inplace=True)
```

📌 **Why this works?**  
- Groups salaries **by experience level**  
- Computes **median salary for each group**  
- Uses `.map()` to **fill missing values accordingly**  

---

## 7️⃣ Final Check: No More Missing Values!  

To verify:  

```python
salaries.isna().sum()
```

---

## 🎯 Let's Practice!  

Try these techniques on a new dataset:  
✔ Identify missing values  
✔ Decide whether to **drop or impute**  
✔ Apply different **imputation strategies**  

Handling missing data **ensures reliable analysis & accurate models!** 🚀  


![image.png](attachment:5a12f7b7-444d-4fd9-8795-d42449b400a5.png)

In [None]:
# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

# Create a filter
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]

# Drop missing values for columns below the threshold
planes.dropna(subset=cols_to_drop, inplace=True)

print(planes.isna().sum())

![image.png](attachment:feefe9e0-9475-4537-8365-0e7d9db1ae6c.png)

In [None]:
# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())

# Create a box plot of Price by Airline
sns.boxplot(data=planes, x='Airline', y='Price')

plt.show()

![image.png](attachment:12f9421a-bf36-4ccb-86bb-7b5fa3aedfa6.png)

In [None]:
# Calculate median plane ticket prices by Airline
airline_prices = planes.groupby("Airline")["Price"].median()

print(airline_prices)

# Convert to a dictionary
prices_dict = airline_prices.to_dict()

# Map the dictionary to missing values of Price by Airline
print(planes["Airline"].map(prices_dict))
planes["Price"] = planes["Price"].fillna(planes["Airline"].map(prices_dict))

# Check for missing values
print(planes.isna().sum())

<a id='Converting_and_analyzing_categorical_data'></a>
# Converting_and_analyzing_categorical_data

Categorical data, such as **job titles, experience levels, and company sizes**, can be analyzed by:  
✔ **Counting unique values**  
✔ **Extracting patterns from text**  
✔ **Creating new categorical columns**  
✔ **Visualizing category distributions**  

---

## 1️⃣ Previewing Categorical Data  

We can **filter out non-numeric data** using `.select_dtypes()`, then preview with `.head()`:  

```python
salaries.select_dtypes(exclude="number").head()
```

Example output (first 5 rows):

| Designation        | Experience | Employment_Status | Company_Size |
|--------------------|------------|-------------------|--------------|
| Data Scientist    | Senior     | Full-Time        | Large        |
| ML Engineer       | Mid-Level  | Contract        | Medium       |

---

## 2️⃣ Counting Unique Values in a Column  

To check the **number of unique job titles**:  

```python
salaries["Designation"].nunique()
```

🔍 Example output: `50 unique job titles`  

---

## 3️⃣ Extracting Value from Categories  

We can filter job titles containing **specific keywords** using `.str.contains()`:  

```python
salaries["Designation"].str.contains("Scientist")
```

🔍 Returns **True/False** values for rows where "Scientist" is present.  

---

## 4️⃣ Searching for Multiple Keywords  

We can **search for multiple phrases** using a **pipe (`|`) operator**:  

```python
salaries["Designation"].str.contains("Machine Learning|AI")
```

📌 **Important:** No spaces around `|` → `"ML | AI"` **won't work** correctly!  

To filter job titles **starting with** a word like "Data", use `^`:  

```python
salaries["Designation"].str.contains("^Data")
```

✅ Matches **"Data Scientist"**  
❌ Does not match **"Big Data Engineer"**  

---

## 5️⃣ Creating a List of Job Categories  

We define **job categories** based on **common keywords**:  

```python
data_science = "Data Scientist|NLP"
data_analyst = "Analyst|Analytics"
data_engineer = "Data Engineer"
ml_engineer = "Machine Learning Engineer"
managerial = "Manager|Lead"
consultant = "Consultant"
```

We store these **filters in a list**:  

```python
job_categories = [
    data_science, data_analyst, data_engineer,
    ml_engineer, managerial, consultant
]
```

---

## 6️⃣ Creating a Categorical Column  

We use **NumPy’s `select()` function** to classify job roles:  

```python
import numpy as np

conditions = [
    salaries["Designation"].str.contains(data_science),
    salaries["Designation"].str.contains(data_analyst),
    salaries["Designation"].str.contains(data_engineer),
    salaries["Designation"].str.contains(ml_engineer),
    salaries["Designation"].str.contains(managerial),
    salaries["Designation"].str.contains(consultant)
]

categories = ["Data Science", "Data Analyst", "Data Engineer", 
              "ML Engineer", "Managerial", "Consultant"]

salaries["Job_Category"] = np.select(conditions, categories, default="Other")
```

📌 **Why use `np.select()`?**  
✔ Matches job titles to categories **efficiently**  
✔ Assigns **"Other"** if no match is found  

---

## 7️⃣ Previewing the New Column  

To **check correctness**:  

```python
salaries[["Designation", "Job_Category"]].head()
```

Example output:

| Designation         | Job_Category    |
|--------------------|----------------|
| Data Scientist    | Data Science    |
| ML Engineer       | ML Engineer     |
| Big Data Engineer | Data Engineer   |

✅ **Looks good!**  

---

## 8️⃣ Visualizing Job Category Distribution  

To **plot category frequency** using Seaborn:  

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(data=salaries, x="Job_Category")
plt.xticks(rotation=45)
plt.show()
```

🔍 **Insights:**  
- Most jobs fall under **Data Science, Engineer, and Analyst** roles  
- Very few jobs were classified as **Other**, meaning **high categorization accuracy**  

---

## 🎯 Let's Practice!  

✔ Try **extracting job roles** from a **new dataset**  
✔ **Define** your own categories  
✔ **Visualize** the frequency of categories  

This **enhances data quality** and **enables better analysis!** 🚀  

![image.png](attachment:8437aa3b-3772-4f16-8cfb-e351fa68c99e.png)

In [None]:
# Filter the DataFrame for object columns
non_numeric = planes.select_dtypes("object")

# Loop through columns
for col in non_numeric.columns:
  
  # Print the number of unique values
  print(f"Number of unique values in {col} column: ", non_numeric[col].nunique())

![image.png](attachment:c55c9284-94f7-465a-9db4-a7686cf713dd.png)

In [None]:
# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]

# Create short_flights
short_flights = "^0h|^1h|^2h|^3h|^4h"

# Create medium_flights
medium_flights = "^5h|^6h|^7h|^8h|^9h"

# Create long_flights
long_flights = "10h|11h|12h|13h|14h|15h|16h"

![image.png](attachment:b8516691-d8c0-417e-b395-3fb931c6fa9d.png)

In [None]:
# Create conditions for values in flight_categories to be created
conditions = [
    (planes["Duration"].str.contains(short_flights)),
    (planes["Duration"].str.contains(medium_flights)),
    (planes["Duration"].str.contains(long_flights))
]

# Apply the conditions list to the flight_categories
planes["Duration_Category"] = np.select(conditions, 
                                        flight_categories,
                                        default="Extreme duration")

# Plot the counts of each category
sns.countplot(data=planes, x="Duration_Category")
plt.show()

<a id='Working_with_numeric_data'></a>
# Working_with_numeric_data


Numeric data allows us to:  
✔ **Clean & convert values**  
✔ **Perform currency conversions**  
✔ **Calculate summary statistics**  
✔ **Add insights to the DataFrame**  

---

## 1️⃣ Loading the Salaries Dataset  

Let's print a **summary of the dataset**:  

```python
salaries.info()
```

🔍 **Observation:**  
- No `Salary_USD` column  
- Instead, there's a `Salary_In_Rupees` column  

---

## 2️⃣ Cleaning Salary Data  

### 🔹 Issue: Salary values contain **commas** & are stored as **strings**  
### 🔹 Solution: Remove commas & convert to **float**  

```python
salaries["Salary_In_Rupees"] = salaries["Salary_In_Rupees"].str.replace(",", "").astype(float)
```

✅ **No more commas!**  

---

## 3️⃣ Converting Rupees to USD  

We apply the **conversion rate**:  

```python
conversion_rate = 0.012  # 1 INR = 0.012 USD
salaries["Salary_USD"] = salaries["Salary_In_Rupees"] * conversion_rate
```

### 🔍 Checking Conversion  
```python
salaries[["Salary_In_Rupees", "Salary_USD"]].head()
```

📌 **Salary in USD** is **1.2% of INR salary**  

---

## 4️⃣ Calculating Mean Salary by Company Size  

Using `groupby()` to **get mean salary per company size**:  

```python
salaries.groupby("Company_Size")["Salary_USD"].mean()
```

---

## 5️⃣ Adding Summary Statistics  

### Standard Deviation of Salary **by Experience**  

```python
salaries["Salary_StdDev"] = salaries.groupby("Experience")["Salary_USD"].transform(lambda x: x.std())
```

### 🔍 Checking Results  
```python
salaries[["Experience", "Salary_StdDev"]].value_counts()
```

📌 **Senior-level (SE) employees have the highest salary variation!**  

---

## 6️⃣ Adding Median Salary by Company Size  

```python
salaries["Median_Salary"] = salaries.groupby("Company_Size")["Salary_USD"].transform(lambda x: x.median())
```

✅ **Medium-sized companies have the highest median salaries!**  

---

## 🎯 Let's Practice!  

✔ Try **converting different currencies**  
✔ Compute **more summary statistics**  
✔ **Visualize salary trends**  

🚀 **Numeric analysis is powerful—let’s explore more!**

![image.png](attachment:498f0891-27dd-4a8a-9a8f-f6b152aad75a.png)

In [None]:
# Preview the column
print(planes["Duration"].head())

# Remove the string character
planes["Duration"] = planes["Duration"].str.replace("h", "")

# Convert to float data type
planes["Duration"] = planes["Duration"].astype(float)

# Plot a histogram
sns.histplot(x="Duration", data=planes)
plt.show()

![image.png](attachment:2e68ce86-65e2-414b-962a-59be3ae69be1.png)

In [None]:
# Price standard deviation by Airline
planes["airline_price_st_dev"] = planes.groupby("Airline")["Price"].transform(lambda x: x.std())

print(planes[["Airline", "airline_price_st_dev"]].value_counts())

In [None]:
# Median Duration by Airline
planes["airline_median_duration"] = planes.groupby("Airline")["Duration"].transform(lambda x: x.median())

print(planes[["Airline","airline_median_duration"]].value_counts())

In [None]:
# Mean Price by Destination
planes["price_destination_mean"] = planes.groupby("Destination")["Price"].transform(lambda x: x.mean())

print(planes[["Destination","price_destination_mean"]].value_counts())

<a id='Handling_outliers'></a>
# Handling_outliers



## What is an Outlier?
An **outlier** is a data point that is significantly different from the rest of the dataset. For example, in a housing dataset with a median price of **$400,000**, a house priced at **$5 million** would likely be an outlier. However, other factors such as location, size, and number of bedrooms can influence whether a value is truly an outlier.

## Identifying Outliers Using Descriptive Statistics
A quick way to detect outliers is by using the `.describe()` method in Pandas. If the **maximum value** is significantly larger than the **mean** and **median**, this suggests extreme values.

## Interquartile Range (IQR) Method
Outliers can be mathematically defined using the **Interquartile Range (IQR)**:
- **IQR = 75th percentile - 25th percentile**
- Outliers are:
  - **Above**: `75th percentile + 1.5 * IQR`
  - **Below**: `25th percentile - 1.5 * IQR`

Box plots visualize the IQR, where **outliers** appear as points outside the whiskers.

## Calculating Outlier Thresholds
To calculate outlier thresholds:
```python
q75 = df["Salary_USD"].quantile(0.75)
q25 = df["Salary_USD"].quantile(0.25)
iqr = q75 - q25

upper_limit = q75 + 1.5 * iqr
lower_limit = q25 - 1.5 * iqr
print(lower_limit, upper_limit)
```
If the **lower limit** is negative, it can be ignored for datasets like salaries.

## Filtering Outliers
To find outliers:
```python
outliers = df[(df["Salary_USD"] > upper_limit) | (df["Salary_USD"] < lower_limit)]
```
To **remove outliers**:
```python
df_no_outliers = df[(df["Salary_USD"] >= lower_limit) & (df["Salary_USD"] <= upper_limit)]
```

## Impact of Removing Outliers
- **Mean salary decreases** after removing outliers.
- The **salary distribution** becomes more **normally distributed**, as seen in histograms before and after filtering.

## What to Do About Outliers?
- **Keep them** if they represent real, meaningful data (e.g., senior executives earning high salaries).
- **Remove them** if they are errors or distort analysis.
- **Transform data** (e.g., log transformation) to reduce skewness.

```markdown
# Next Steps
- Explore and visualize outliers using box plots and histograms.
- Consider the impact of outliers before making business or statistical decisions.


![image.png](attachment:993eda52-4f17-42c5-b15f-10afb01f9b0d.png)

![image.png](attachment:002fa4b6-f256-4d3e-aa43-b822b56d0498.png)

![image.png](attachment:9928cbc1-2a44-4458-9533-e2dcdfd78181.png)

In [None]:
# Plot a histogram of flight prices
sns.histplot(data=planes, x="Price")
plt.show()

# Display descriptive statistics for flight duration
print(planes.Duration.describe())

![image.png](attachment:a14e0017-16df-4b2b-bc31-51e675469fa2.png)

In [None]:
# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)

# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth

# Calculate the thresholds
upper = price_seventy_fifth + (1.5 * prices_iqr)
lower = price_twenty_fifth - (1.5 * prices_iqr)

# Subset the data
planes = planes[(planes["Price"] > lower) & (planes["Price"] < upper)]

print(planes["Price"].describe())

<a id='Patterns_over_time'></a>
# Patterns_over_time


## Importance of DateTime Data
When working with datasets that include **dates** or **time values**, it's useful to analyze trends over time. 

## Dataset Overview
We'll be using a dataset on **divorce filings in Mexico (2000-2015)**, which includes:
- `marriage_date`
- `marriage_duration` (in years)

## Importing DateTime Data
By default, pandas treats **date and time data** as **strings** when importing from a CSV file.  
To ensure correct interpretation, we use the `parse_dates` argument when reading the CSV:
```python
df = pd.read_csv("divorce_data.csv", parse_dates=["marriage_date"])
```
Now, pandas recognizes `marriage_date` as a **DateTime object**, enabling time-based analysis.

## Converting to DateTime Format After Import
If a dataset is already loaded but dates are stored as strings, we can convert them using:
```python
df["marriage_date"] = pd.to_datetime(df["marriage_date"])
```

## Creating DateTime Data
If dates are split into separate `year`, `month`, and `day` columns, we can combine them:
```python
df["marriage_date"] = pd.to_datetime(df[["year", "month", "day"]])
```
**Note:** Column names **must be "year", "month", and "day"** but can appear in any order.

## Extracting Components from DateTime
Once a column is in DateTime format, we can extract specific components:
```python
df["marriage_month"] = df["marriage_date"].dt.month
df["marriage_year"] = df["marriage_date"].dt.year
```

## Visualizing Patterns Over Time
**Line plots** help visualize trends.  
Seaborn aggregates `y` values at each `x` and shows **mean estimates with confidence intervals**.
For example, to analyze marriage month vs. marriage duration:
```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.lineplot(x=df["marriage_month"], y=df["marriage_duration"])
plt.xlabel("Marriage Month")
plt.ylabel("Marriage Duration (Years)")
plt.title("Marriage Duration by Month")
plt.show()
```
- The **blue line** represents the mean marriage duration.
- The **lighter shaded area** indicates the **95% confidence interval**.
- **Wide confidence intervals** suggest that more data exploration is needed.

## Next Steps
- Practice working with DateTime data using a larger divorce dataset.
- Explore additional time-based trends such as **seasonal variations** or **yearly changes**.

![image.png](attachment:cd5d2620-4e0d-414d-8d32-996b2fe7a593.png)

In [None]:
# Import divorce.csv, parsing the appropriate columns as dates in the import
divorce = pd.read_csv('divorce.csv', parse_dates = ['divorce_date', 'dob_man', 'dob_woman', 'marriage_date'])
print(divorce.dtypes)

![image.png](attachment:87612c75-b76a-432e-ae5e-c36e6cc2c541.png)

In [None]:
# Convert the marriage_date column to DateTime values
divorce["marriage_date"] = pd.to_datetime(divorce["marriage_date"])

![image.png](attachment:c95940f2-6f49-4f14-9d19-791ea127a0c6.png)

In [None]:
# Define the marriage_year column
divorce["marriage_year"] = divorce["marriage_date"].dt.year

# Create a line plot showing the average number of kids by year
sns.lineplot(x='marriage_year', y='num_kids', data=divorce)
plt.show()

<a id='Correlation'></a>
# Correlation

## Understanding Correlation
Correlation helps us evaluate **relationships between variables**, guiding how data should be used. It describes:
- **Direction**: Positive or negative relationship.
- **Strength**: Weak or strong association.

## Calculating Correlation in Pandas
The `corr()` method in pandas computes the **Pearson correlation coefficient**, which measures the **linear** relationship between numeric variables:
```python
df.corr()
```
- **Negative values**: One variable increases, the other decreases.
- **Values near 0**: Weak correlation.
- **Values near ±1**: Strong correlation.

## Visualizing Correlation with Heatmaps
Heatmaps enhance interpretation by using **color coding**:
```python
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
```
- **Deep purple**: Strong positive correlation.
- **Beige**: Strong negative correlation.
- **Example**: Marriage year and marriage duration show strong **negative correlation**.

## Correlation in Context
Interpreting correlation **requires understanding the dataset**.  
For instance:
- The dataset contains **only marriages that ended (2000-2015)**.
- This means **later marriages are shorter** by definition, leading to a **negative correlation**.

## Visualizing Relationships with Scatter Plots
The **Pearson coefficient** only measures **linear relationships**.  
Some relationships may be **non-linear**, which is why **scatter plots** help!

```python
sns.scatterplot(x=df["female_income"], y=df["male_income"])
plt.xlabel("Female Income")
plt.ylabel("Male Income")
plt.title("Scatter Plot of Partner Incomes")
plt.show()
```
- Example: The heatmap showed a correlation coefficient of **0.32** between **female** and **male** income.
- The scatter plot confirms this **weak positive relationship**.

## Pairplots for Multiple Variables
**Seaborn’s `pairplot()`** visualizes **all** pairwise relationships in one figure:
```python
sns.pairplot(df)
plt.show()
```
- The **diagonal** shows individual variable distributions.
- Can be overwhelming with **large datasets**.

To focus on **specific variables**, use:
```python
sns.pairplot(df, vars=["female_income", "male_income", "marriage_duration"])
plt.show()
```
- Confirms weak relationships between **income and marriage duration**.
- Reveals **shorter marriages are more common**.

## Next Steps
- Use scatter plots and heatmaps to explore relationships.
- Identify potential **non-linear** patterns.
- Investigate **causation vs correlation** in the dataset.

![image.png](attachment:58ad6462-e597-48cf-bdb5-a4f9a1d95379.png)

In [None]:
# Create the scatterplot
sns.scatterplot(x='marriage_duration', y='num_kids', data=divorce)
plt.show()

![image.png](attachment:9d149302-0774-4a83-b2f0-b8206cce8360.png)

In [None]:
# Create a pairplot for income_woman and marriage_duration
sns.pairplot(data=divorce, vars=['income_woman' , 'marriage_duration'])
plt.show()

<a id='Factor_relationships_and_distributions'></a>
# Factor_relationships_and_distributions


## Exploring Categorical Variables
While we've analyzed relationships between **numerical variables**, categorical variables (**factors**) also hold valuable insights. Let's explore **education level** in our dataset.

### Education Level of Male Partners
Checking the distribution of education levels:
```python
df["education_man"].value_counts()
```
- Most men have **primary to professional** education.
- Some fall into the **"None" or "Other"** categories.

## Exploring Categorical Relationships
### Visualizing Marriage Duration by Education Level
Histograms can help us **compare distributions**.  
Let's examine the relationship between **marriage duration** and **male education level**:
```python
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.histplot(df, x="marriage_duration", hue="education_man", multiple="stack")
plt.title("Marriage Duration by Male Education Level")
plt.show()
```
- This confirms that **many men in the dataset have professional education**.
- However, the **overlapping bars make interpretation difficult**.

## Kernel Density Estimate (KDE) Plots
**KDE plots** provide **smoother distributions**, making comparisons easier:
```python
sns.kdeplot(data=df, x="marriage_duration", hue="education_man", common_norm=False)
plt.title("Marriage Duration KDE by Male Education Level")
plt.show()
```
- Peaks for **different education levels** are clearer.
- But **smoothing algorithms** can create unrealistic values (e.g., **negative marriage durations!**).

### Fixing the KDE Plot
We prevent KDE plots from **extending beyond valid values** by setting `cut=0`:
```python
sns.kdeplot(data=df, x="marriage_duration", hue="education_man", common_norm=False, cut=0)
plt.title("Marriage Duration KDE (Fixed)")
plt.show()
```
- Now the plot correctly starts at **1 year**, the shortest marriage in our dataset.

## Cumulative KDE Plots
To visualize the **cumulative distribution function (CDF)**:
```python
sns.kdeplot(data=df, x="marriage_duration", hue="education_man", common_norm=False, cumulative=True)
plt.title("Cumulative Distribution of Marriage Duration by Education Level")
plt.show()
```
- This tells us **the probability that a marriage lasted up to a certain duration**.

## Relationship Between Marriage Age and Education
Do **older couples** tend to have **higher education levels**?
First, create **approximate marriage age** columns:
```python
df["age_at_marriage_male"] = df["marriage_year"] - df["birth_year_male"]
df["age_at_marriage_female"] = df["marriage_year"] - df["birth_year_female"]
```
Then, plot a **scatter plot**:
```python
sns.scatterplot(x=df["age_at_marriage_male"], y=df["age_at_marriage_female"])
plt.xlabel("Male Age at Marriage")
plt.ylabel("Female Age at Marriage")
plt.title("Marriage Age Relationship")
plt.show()
```
- **Pearson correlation** = **0.69** → Strong positive correlation!

### Adding Education Level to the Scatter Plot
To introduce **education level**, use the `hue` argument:
```python
sns.scatterplot(x=df["age_at_marriage_male"], y=df["age_at_marriage_female"], hue=df["education_man"])
plt.xlabel("Male Age at Marriage")
plt.ylabel("Female Age at Marriage")
plt.title("Marriage Age vs. Education Level")
plt.show()
```
- **Orange dots** (men with professional education) suggest they **tend to marry later**.

## Next Steps
- **Practice exploring categorical relationships** with KDE and scatter plots.
- Consider **additional categorical variables** like **female education level**.
- Investigate **interaction effects** between education, income, and marriage duration.

![image.png](attachment:ced50ab4-9308-4c81-a974-2ff596687bc1.png)

In [None]:
# Create the scatter plot
sns.scatterplot(x='woman_age_marriage', y='income_woman', hue='education_woman', data=divorce)
plt.show()

![image.png](attachment:3952c7e1-4981-4810-9d74-d9ba4a5c2b08.png)

In [None]:
# Create the KDE plot
sns.kdeplot(x='marriage_duration', hue='num_kids', data=divorce)
plt.show()

In [None]:
# Update the KDE plot to show a cumulative distribution function
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0, cumulative=True)
plt.show()

<a id='Considerations_for_categorical_data'></a>
# Considerations_for_categorical_data



## Why Perform Exploratory Data Analysis (EDA)?
EDA helps us:
- **Detect patterns and relationships** in data.
- **Generate hypotheses** for further analysis.
- **Prepare data** for machine learning models.

## Representative Data
For any dataset to be useful, it must be **representative** of the population being studied.  
Example: If analyzing **education level vs. income in the USA**, we **must** collect data from **adults in the USA**, not France.

## Understanding Categorical Classes
Categorical variables have **labels or classes**.  
Example: **Marital Status**  
- Single  
- Married  
- Divorced  

## Class Imbalance
If a dataset contains **unbalanced classes**, it may lead to **biased results**.  
Example:  
| Marital Status | Count |
|---------------|-------|
| Married       | 50    |
| Divorced      | 700   |
| Single        | 250   |

Does this dataset accurately represent the population's view on marriage?  
- **Divorced people** may have a **negative bias** toward marriage.
- **Over-represented classes** can **skew model training and predictions**.

## Measuring Class Frequency
We can check how many occurrences each category has using `value_counts()`:
```python
df["marital_status"].value_counts()
```
For **relative frequencies** (percentages), use:
```python
df["marital_status"].value_counts(normalize=True)
```
Example Output:
```
Divorced   70.0%
Single     25.0%
Married    5.0%
```
This suggests that **Married individuals are underrepresented**.

## Cross-Tabulation (pd.crosstab)
Cross-tabulation helps us analyze relationships between **two categorical variables**.

Example: **Flights Dataset**
```python
import pandas as pd

pd.crosstab(df["Source"], df["Destination"])
```
- **Rows** = Departure Cities
- **Columns** = Destination Cities
- **Cell Values** = Count of flights on that route

### Finding the Most Popular Route
```python
route_counts = pd.crosstab(df["Source"], df["Destination"])
route_counts.idxmax().head()
```
> **Most popular route**: **Delhi → Cochin (4,318 flights)**

## Extending Cross-Tabulation: Comparing Prices
Let's compare **actual vs. expected median flight prices**:

```python
pd.crosstab(df["Source"], df["Destination"], values=df["Price"], aggfunc="median")
```
- **Values Column**: `"Price"`
- **Aggregation Function**: `"median"`

### Sample vs. Population
Comparing **dataset prices** to **expected prices**, we notice:
- **Bangalore → Delhi** is **more expensive** in our dataset.
- This suggests our dataset might **not be representative** of real-world prices.

## Next Steps
- Check **class frequencies** in your dataset.
- Identify **imbalanced classes** and consider adjustments.
- Use **cross-tabulation** for better insights into categorical relationships.



![image.png](attachment:ba3a6803-edac-4c67-a8be-eddf4c0e7d09.png)

In [None]:
# Print the relative frequency of Job_Category
print(salaries['Job_Category'].value_counts(normalize=True))

![image.png](attachment:42779212-ad3b-41fa-9f89-b18c054c9c68.png)

In [None]:
# Cross-tabulate Company_Size and Experience
print(pd.crosstab(salaries["Company_Size"], salaries["Experience"]))

In [None]:
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"]))

In [None]:
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"],
            values=salaries["Salary_USD"], aggfunc="mean"))

<a id='Generating_new_features'></a>
# Generating_new_features

Sometimes, the way our data is formatted can limit our ability to detect relationships or impact the performance of machine learning models. One way to overcome these limitations is by generating new features from our existing data.

## Understanding Correlation  

A heatmap helps us visualize relationships between numeric variables. In our dataset, we see a moderate positive correlation between `Price` and `Duration`. However, there are very few numeric variables available.

## Converting Total Stops into a Numeric Feature  

Looking at the data types, `Total_Stops` should be numeric, but it is stored as a string. Checking `value_counts()`, we see values like `"2 stops"`, `"1 stop"`, and `"non-stop"`. We need to clean this data before converting it into an integer format:

1. Use `.str.replace()` to remove `" stops"` and `" stop"`.
2. Convert `"non-stop"` to `"0"`.
3. Change the column’s data type to integer.

After this transformation, `Total_Stops` is strongly correlated with `Duration`, and interestingly, it has a stronger correlation with `Price` than `Duration` does!

## Extracting Date and Time Features  

Our dataset contains three datetime variables:  
- `Date_of_Journey`  
- `Dep_Time` (departure time)  
- `Arrival_Time`  

We can extract new features from these variables to uncover potential insights.

### Extracting Month and Weekday  
If we suspect that prices vary by month or day of the week, we can create new columns:  
- **Month:** Extracted from `Date_of_Journey`.  
- **Weekday:** Extracted using `.dt.weekday`, where `0` represents Monday and `6` represents Sunday.  

Previewing the data, we see that a flight on **September 6th** (a Friday) is correctly represented as `4`.

### Extracting Departure and Arrival Hours  
People might pay more for flights at convenient times. We extract the hour from both `Dep_Time` and `Arrival_Time` using `.dt.hour`.

### Checking Correlation Again  
Since these new features are numeric, we check their correlation with `Price` and other variables. Unfortunately, no new strong relationships emerge, but without generating these features, we wouldn’t have known!

## Creating Categorical Features  

Another useful technique is **grouping numerical data into labeled categories**. For instance, we don’t have a column for **ticket type** (e.g., Economy, Business, First Class). We can create this using price ranges.

### Defining Price Categories  
To categorize ticket prices, we use quartiles:  
1. **25th percentile:** `df["Price"].quantile(0.25)`  
2. **50th percentile (median):** `df["Price"].median()`  
3. **75th percentile:** `df["Price"].quantile(0.75)`  
4. **Max price:** `df["Price"].max()`  

We create labels: `["Economy", "Premium Economy", "Business", "First Class"]`, then use `pd.cut()` to classify each price into these categories.

### Visualizing Price Categories by Airline  
Using `sns.countplot()`, we set the `hue` argument to our new **Price_Category** column. The results show that:  
- **Jet Airways** has the most "First Class" tickets.  
- **IndiGo** and **SpiceJet** primarily offer "Economy" tickets.  

## Time to Practice!  
Now it's your turn to apply these feature engineering techniques to your own dataset!

![image.png](attachment:53bfa0e9-a4f7-45a7-8c84-912d9a46b69d.png)

In [None]:
# Get the month of the response
salaries["month"] = salaries["date_of_response"].dt.month

# Extract the weekday of the response
salaries["weekday"] = salaries["date_of_response"].dt.weekday

# Create a heatmap
sns.heatmap(salaries.corr(), annot=True)
plt.show()

![image.png](attachment:fc6c3351-7bef-499f-9911-1e67cd4cea5a.png)

In [None]:
# Find the 25th percentile
twenty_fifth = salaries["Salary_USD"].quantile(0.25)

# Save the median
salaries_median = salaries["Salary_USD"].median()

# Gather the 75th percentile
seventy_fifth = salaries["Salary_USD"].quantile(0.75)
print(twenty_fifth, salaries_median, seventy_fifth)

![image.png](attachment:ac2761fd-d36c-44c5-8bc8-e095be9863f4.png)

In [None]:
# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]

# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]

# Create salary_level
salaries["salary_level"] = pd.cut(salaries["Salary_USD"],
                                  bins=salary_ranges,
                                  labels=salary_labels)

# Plot the count of salary levels at companies of different sizes
sns.countplot(data=salaries, x="Company_Size", hue="salary_level")
plt.show()

<a id='Generating_hypotheses'></a>
# Generating_hypotheses


Generating hypotheses is a fundamental task for data scientists. Let’s explore how and when this is done!

## What Do We Know?  

At this point, we’ve explored our dataset extensively and even generated new features to uncover insights.  
- We know that many of **Jet Airways'** tickets are expensive, categorized as **First Class**.  
- **Duration, Total_Stops, and Price** are moderately correlated, but no other meaningful relationships exist.

## Spurious Correlation  

A scatter plot of **Price vs. Duration**, factoring in **Total_Stops**, suggests that **Total_Stops largely depend on Duration**.  
- This is an example of a **spurious correlation**—while we may think **Total_Stops** is correlated with **Price**, it’s actually **Duration** driving the relationship!  
- Breaking it down further, we see that **zero-stop flights have a strong negative correlation with price**, but flights with **three or four stops** show no meaningful relationship.

## What Is True?  

During **Exploratory Data Analysis (EDA)**, a key question to ask is: **How do we know our observations are true?**  
- If we collected new flight data from a different time period, would we see the same patterns?  
- To draw **reliable conclusions**, we need **Hypothesis Testing**, a branch of statistics that allows us to test relationships and patterns.  

Before collecting data, we must:  
1. **Formulate a hypothesis or question.**  
2. **Specify the statistical test** that will determine whether our hypothesis is true or false.  

## The Danger of Data Snooping  

Imagine we work for an agency regulating airlines, and we have constant access to flight data. It might seem tempting to **generate hypotheses after analyzing the data extensively**, but this can introduce **bias**:  
- We might unconsciously look for patterns that confirm what we already believe.  
- With large datasets, running multiple tests increases the likelihood of **finding a significant result by chance**.  
- This is known as **data snooping (or p-hacking)**—analyzing data excessively and conducting multiple statistical tests until a "significant" finding emerges.

## How to Generate Hypotheses  

The right way to generate hypotheses is by performing **Exploratory Data Analysis (EDA)**. For example:  
- We might hypothesize that **Jet Airways flights last longer than SpiceJet flights**. A bar plot of **mean duration per airline** can help us assess this.  
- Or, we may suspect that **flights to New Delhi are more expensive** than other destinations. Again, plotting the data can help reveal trends.

## Next Steps  

Once we have a hypothesis, we need to design an experiment:  
1. **Choose a sample** and determine how much data we need.  
2. **Decide on an appropriate statistical test** to validate our hypothesis.  

Although designing experiments is outside the scope of this course, this gives us a sense of the role of EDA in data science!

## Let’s Practice!  

Now it’s time to generate some hypotheses from our own dataset!

![image.png](attachment:9ba04994-2bad-4d9b-b1f5-a566dad19d5f.png)

In [None]:
# Filter for employees in the US or GB
usa_and_gb = salaries[salaries["Employee_Location"].isin(["US", "GB"])]

# Create a barplot of salaries by location
sns.barplot(data=usa_and_gb, x="Employee_Location", y="Salary_USD")
plt.show()

![image.png](attachment:5c627083-144a-47dd-9d82-5bcf55f7a885.png)

In [None]:
# Create a bar plot of salary versus company size, factoring in employment status
sns.barplot(data=salaries, x="Company_Size", y="Salary_USD", hue="Employment_Status")
plt.show()

![image.png](attachment:04be84ff-2faf-4527-9007-127c64f730b0.png)

![image.png](attachment:6bc879f2-3583-41b8-829a-4963ed105928.png)