# Table of Contents
<li><a href="#Initial_exploration">Initial_exploration</a></li>
<li><a href="#Data_validation">Data_validation</a></li>
<li><a href="#Data_summarization">Data_summarization</a></li>
<li><a href="#Addressing_missing_data">Addressing_missing_data</a></li>
<li><a href="#Converting_and_analyzing_categorical_data">Converting_and_analyzing_categorical_data</a></li>
<li><a href="#Working_with_numeric_data">Working_with_numeric_data</a></li>
<li><a href="#Write_Here">Write_Here</a></li>

In [42]:
input().replace(' ', '_')

 Working with numeric data


'Working_with_numeric_data'

<a id='Initial_exploration'></a>
# Initial_exploration

Welcome to **EDA**! Before we apply machine learning or statistical analysis, we need to **understand our dataset**.  

---

## 1️⃣ What is Exploratory Data Analysis (EDA)?  
EDA is the process of **cleaning, summarizing, and visualizing data** to:  
✔ Identify **missing values** & **data types**.  
✔ Compute **descriptive statistics**.  
✔ Detect **patterns & correlations**.  
✔ Generate **hypotheses** for deeper analysis.  

📌 **Example:** Given a **books dataset**, we can explore:  
- **How many genres are represented?**  
- **What is the average book rating?**  
- **Are there any missing values?**  

---

## 2️⃣ Loading the Data  
Let's start by importing Pandas and loading a **CSV file**:  

```python
import pandas as pd  

# Load the dataset
books = pd.read_csv("books.csv")

# Display the first 5 rows
print(books.head())
```

👀 **What to look for?**  
✔ **Column names** (e.g., "title", "author", "genre", "rating").  
✔ **General structure** of the dataset.  

---

## 3️⃣ Checking Data Types & Missing Values  
The `.info()` method gives a quick overview:  

```python
print(books.info())
```

✔ **Column names & data types**  
✔ **Non-null values (missing data check)**  
✔ **Memory usage**  

---

## 4️⃣ Understanding Categorical Columns  
We often want to know **how many data points exist in each category**.  
For example, to check **genre distribution**:  

```python
print(books["genre"].value_counts())
```

📌 This helps us understand **which categories dominate the dataset**.  

---

## 5️⃣ Understanding Numerical Columns  
Use `.describe()` to get summary statistics:  

```python
print(books.describe())
```

🔹 **Key statistics provided:**  
✔ **Count** → Number of non-null values.  
✔ **Mean** → Average value.  
✔ **Standard Deviation** → Spread of the data.  
✔ **Min & Max** → Range of values.  
✔ **Quartiles (25%, 50%, 75%)** → Distribution insights.  

---

## 6️⃣ Visualizing Data with Histograms 📊  
Histograms help us understand **how numerical values are distributed**.  

```python
import seaborn as sns  
import matplotlib.pyplot as plt  

# Create a histogram of book ratings
sns.histplot(data=books, x="rating")

# Show the plot
plt.show()
```

👀 **What to look for?**  
✔ **Where is the data concentrated?**  
✔ **Are there outliers?**  

---

## 7️⃣ Adjusting Bin Width  
To **increase granularity**, set `binwidth=0.1`:  

```python
sns.histplot(data=books, x="rating", binwidth=0.1)
plt.show()
```

✅ **Smaller bins → More precise visualization**.  

---

## 8️⃣ Let’s Practice! 🚀  
Now, try applying these techniques to **a new dataset**:  
✔ **Load the dataset**  
✔ **Check for missing values**  
✔ **Explore categorical & numerical columns**  
✔ **Visualize key distributions**  

EDA is the **first step** in making data-driven decisions—master it! 🎯  

![image.png](attachment:3b80c59d-8bf2-4188-823c-5bd340c2fc56.png)

In [None]:
# Print the first five rows of unemployment
print(unemployment.head())

In [None]:
# Print a summary of non-missing values and data types in the unemployment DataFrame
print(unemployment.info())

![image.png](attachment:d1135dde-9153-446d-b811-995b082e04ff.png)

In [None]:
# Count the values associated with each continent in unemployment
print(unemployment['continent'].value_counts())

![image.png](attachment:92276dd3-b295-411e-8ca2-a0b4436c8a1f.png)

In [None]:
# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(x='2021', data=unemployment, binwidth=1)
plt.show()

<a id='Data_validation'></a>
# Data_validation

Before diving deep into analysis, we need to **validate our data** to ensure:  
✔ **Data types are correct**  
✔ **Categorical values match expected categories**  
✔ **Numerical values fall within expected ranges**  

---

## 1️⃣ Validating Data Types  

We use `.info()` to check **column types & missing values**:  

```python
print(books.info())
```

🔹 Alternatively, to check **only data types**:  

```python
print(books.dtypes)
```

---

## 2️⃣ Updating Data Types  

Sometimes, columns are stored in the wrong format. For example, **years should be integers**:  

```python
books["year"] = books["year"].astype(int)
print(books.dtypes)
```

📌 **Common data types & their Python names:**  
- `int` → Whole numbers  
- `float` → Decimal numbers  
- `str` → Text data  
- `bool` → True/False values  

---

## 3️⃣ Validating Categorical Data  

To check if values in a column match **expected categories**, use `.isin()`:  

```python
valid_genres = ["Fiction", "Non Fiction"]
print(books["genre"].isin(valid_genres))
```

🔍 This returns `True` or `False` for each row.  

🔹 To **invert** the results (show invalid values):  

```python
print(books[~books["genre"].isin(valid_genres)])
```

🔹 To **filter only valid rows**:  

```python
valid_books = books[books["genre"].isin(valid_genres)]
```

---

## 4️⃣ Validating Numerical Data  

To select **only numerical columns**:  

```python
numerical_data = books.select_dtypes(include="number")
print(numerical_data.head())
```

📌 **Checking numerical ranges**:  

```python
print("Min Year:", books["year"].min())
print("Max Year:", books["year"].max())
```

---

## 5️⃣ Visualizing Numerical Data  

Use a **boxplot** to detect **outliers & distributions**:  

```python
import seaborn as sns  
import matplotlib.pyplot as plt  

sns.boxplot(data=books, x="year")
plt.show()
```

📊 **Interpreting the Boxplot:**  
✔ **Min & Max** → The smallest & largest values  
✔ **25th & 75th Percentile** → The middle 50% of data  
✔ **Median** → The middle value  

🔹 **Grouping by categories**:  

```python
sns.boxplot(data=books, x="genre", y="year")
plt.show()
```

---

## 🎯 Let’s Practice!  

Now, apply these validation steps to **a new dataset**:  
✔ **Check & update data types**  
✔ **Validate categorical values**  
✔ **Check numerical ranges**  
✔ **Use visualizations to spot issues**  

Ensuring **clean data** leads to **more reliable insights**! 🚀  

![image.png](attachment:f6f88571-af20-4b67-aaf2-52cd2006c969.png)

In [None]:
# Update the data type of the 2019 column to a float
unemployment["2019"] = unemployment["2019"].astype(float)
# Print the dtypes to check your work
print(unemployment.dtypes)

![image.png](attachment:e705fd97-c6e7-413f-ad28-bdcccff80a3f.png)

In [None]:
# Define a Series describing whether each continent is outside of Oceania
not_oceania = ~unemployment["continent"].isin(["Oceania"])

# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])

![image.png](attachment:a8f90c90-f076-4d84-98a8-006b1c3c2a80.png)

In [None]:
# Print the minimum and maximum unemployment rates during 2021
print(unemployment['2021'].min(), unemployment['2021'].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(x='2021', y='continent', data=unemployment)
plt.show()

<a id='Data_summarization'></a>
# Data_summarization


After validating our data, it's time to **summarize key insights** by:  
✔ **Grouping data** to compare categories  
✔ **Applying aggregations** for numerical summaries  
✔ **Visualizing categorical comparisons**  

---

## 1️⃣ Grouping Data with `.groupby()`  

We use `.groupby()` to summarize numerical data within each **category**:  

```python
books.groupby("genre").mean()
```

🔍 This computes the **average** for all numerical columns within each genre.  

🔹 **Other aggregation functions:**  
- `.sum()` → Total sum  
- `.count()` → Number of values  
- `.min()` / `.max()` → Min & max values  
- `.var()` → Variance  
- `.std()` → Standard deviation  

---

## 2️⃣ Aggregating Ungrouped Data  

To compute **multiple summary statistics** at once:  

```python
books.agg(["mean", "std"])
```

📌 **This applies only to numeric columns!**  

---

## 3️⃣ Specifying Aggregations for Each Column  

We can **control which aggregations** apply to **which columns**:  

```python
books.agg({"rating": ["mean", "std"], "year": "median"})
```

---

## 4️⃣ Named Summary Columns  

To apply multiple aggregations **within groups** and give **custom column names**:  

```python
books.groupby("genre").agg(
    rating_mean=("rating", "mean"),
    rating_std=("rating", "std"),
    median_year=("year", "median")
)
```

🔹 **Output Example:**  

| Genre       | Rating Mean | Rating Std | Median Year |
|------------|-------------|-------------|-------------|
| Fiction     | 4.2         | 0.3         | 2012        |
| Non-Fiction | 4.5         | 0.2         | 2015        |

---

## 5️⃣ Visualizing Categorical Summaries  

A **Seaborn barplot** quickly shows the **mean rating by genre**:  

```python
import seaborn as sns
import matplotlib.pyplot as plt  

sns.barplot(data=books, x="genre", y="rating")
plt.show()
```

📊 **Interpreting the Barplot:**  
✔ **Bar height** → Average rating  
✔ **Vertical line on top** → 95% confidence interval  
✔ **Variation in bars** → Some genres have more rating fluctuations  

---

## 🎯 Let’s Practice!  

Now, apply these summarization techniques to **a new dataset**:  
✔ **Group & aggregate data**  
✔ **Customize aggregations per column**  
✔ **Use visualizations for better insights**  

Summarizing data **helps uncover trends** and **guides further analysis**! 🚀  

![image.png](attachment:5f556853-0eab-44ab-82b1-5a4306fffaae.png)

In [None]:
# Print the mean and standard deviation of rates by year
print(unemployment.agg(['mean', 'std']))

In [None]:
# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby('continent').agg(['mean', 'std']))

![image.png](attachment:5c9a160c-d0c3-48e6-862f-d8b2f47dffff.png)

In [None]:
continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021=('2021', 'mean'),
    # Create the std_rate_2021 column
    std_rate_2021=('2021', 'std')
)
print(continent_summary)

![image.png](attachment:1b2c8276-3354-4d5e-af74-1eedb7b55a87.png)

In [None]:
# Create a bar plot of continents and their average unemployment
sns.barplot(x='continent', y='2021', data=unemployment)
plt.show()

<a id='Addressing_missing_data'></a>
# Addressing_missing_data

Missing data can **skew analysis** and lead to **misleading insights**.  
This guide covers:  
✔ **Detecting missing values**  
✔ **Strategies to handle missing data**  
✔ **Using Pandas functions for imputation & filtering**  

---

## 1️⃣ Why is Missing Data a Problem?  

Missing values can affect:  
✔ **Distributions** → Biased means & incorrect conclusions  
✔ **Correlations** → Distorted relationships between variables  
✔ **Predictive models** → Reduces accuracy  

Example: If **tallest students' heights** are missing, the **average height** will be **lower than reality**.  

---

## 2️⃣ Checking for Missing Values  

To **count missing values** per column:  

```python
salaries.isna().sum()
```

🔍 **Example Output:**

| Column         | Missing Values |
|---------------|----------------|
| Salary_USD   | 60             |
| Job Title    | 10             |

---

## 3️⃣ Strategies for Addressing Missing Data  

✅ **Rule of Thumb:**  
✔ **Drop** missing rows if **≤5% of total data**  
✔ **Impute** missing values if **>5%** using:  
   - **Mean** (for normally distributed data)  
   - **Median** (for skewed data)  
   - **Mode** (for categorical data)  
✔ **Group-based Imputation** → Fill missing values using **median per group**  

---

## 4️⃣ Dropping Missing Values  

If a column has missing values in **≤5% of rows**, we **drop** them:

```python
threshold = len(salaries) * 0.05  # 5% threshold
cols_to_drop = salaries.isna().sum()[salaries.isna().sum() <= threshold].index
salaries.dropna(subset=cols_to_drop, inplace=True)
```

---

## 5️⃣ Imputing Categorical Data  

For categorical columns, we **replace missing values with mode**:

```python
for col in ["Job Title", "Employment Type", "Company Size"]:
    salaries[col].fillna(salaries[col].mode()[0], inplace=True)
```

---

## 6️⃣ Imputing Numerical Data by Group  

When **salary varies by experience level**, we impute **median salary per experience group**:

```python
median_salaries = salaries.groupby("Experience")["Salary_USD"].median().to_dict()
salaries["Salary_USD"].fillna(salaries["Experience"].map(median_salaries), inplace=True)
```

📌 **Why this works?**  
- Groups salaries **by experience level**  
- Computes **median salary for each group**  
- Uses `.map()` to **fill missing values accordingly**  

---

## 7️⃣ Final Check: No More Missing Values!  

To verify:  

```python
salaries.isna().sum()
```

---

## 🎯 Let's Practice!  

Try these techniques on a new dataset:  
✔ Identify missing values  
✔ Decide whether to **drop or impute**  
✔ Apply different **imputation strategies**  

Handling missing data **ensures reliable analysis & accurate models!** 🚀  


![image.png](attachment:5a12f7b7-444d-4fd9-8795-d42449b400a5.png)

In [None]:
# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

# Create a filter
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]

# Drop missing values for columns below the threshold
planes.dropna(subset=cols_to_drop, inplace=True)

print(planes.isna().sum())

![image.png](attachment:feefe9e0-9475-4537-8365-0e7d9db1ae6c.png)

In [None]:
# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())

# Create a box plot of Price by Airline
sns.boxplot(data=planes, x='Airline', y='Price')

plt.show()

![image.png](attachment:12f9421a-bf36-4ccb-86bb-7b5fa3aedfa6.png)

In [None]:
# Calculate median plane ticket prices by Airline
airline_prices = planes.groupby("Airline")["Price"].median()

print(airline_prices)

# Convert to a dictionary
prices_dict = airline_prices.to_dict()

# Map the dictionary to missing values of Price by Airline
print(planes["Airline"].map(prices_dict))
planes["Price"] = planes["Price"].fillna(planes["Airline"].map(prices_dict))

# Check for missing values
print(planes.isna().sum())

<a id='Converting_and_analyzing_categorical_data'></a>
# Converting_and_analyzing_categorical_data

Categorical data, such as **job titles, experience levels, and company sizes**, can be analyzed by:  
✔ **Counting unique values**  
✔ **Extracting patterns from text**  
✔ **Creating new categorical columns**  
✔ **Visualizing category distributions**  

---

## 1️⃣ Previewing Categorical Data  

We can **filter out non-numeric data** using `.select_dtypes()`, then preview with `.head()`:  

```python
salaries.select_dtypes(exclude="number").head()
```

Example output (first 5 rows):

| Designation        | Experience | Employment_Status | Company_Size |
|--------------------|------------|-------------------|--------------|
| Data Scientist    | Senior     | Full-Time        | Large        |
| ML Engineer       | Mid-Level  | Contract        | Medium       |

---

## 2️⃣ Counting Unique Values in a Column  

To check the **number of unique job titles**:  

```python
salaries["Designation"].nunique()
```

🔍 Example output: `50 unique job titles`  

---

## 3️⃣ Extracting Value from Categories  

We can filter job titles containing **specific keywords** using `.str.contains()`:  

```python
salaries["Designation"].str.contains("Scientist")
```

🔍 Returns **True/False** values for rows where "Scientist" is present.  

---

## 4️⃣ Searching for Multiple Keywords  

We can **search for multiple phrases** using a **pipe (`|`) operator**:  

```python
salaries["Designation"].str.contains("Machine Learning|AI")
```

📌 **Important:** No spaces around `|` → `"ML | AI"` **won't work** correctly!  

To filter job titles **starting with** a word like "Data", use `^`:  

```python
salaries["Designation"].str.contains("^Data")
```

✅ Matches **"Data Scientist"**  
❌ Does not match **"Big Data Engineer"**  

---

## 5️⃣ Creating a List of Job Categories  

We define **job categories** based on **common keywords**:  

```python
data_science = "Data Scientist|NLP"
data_analyst = "Analyst|Analytics"
data_engineer = "Data Engineer"
ml_engineer = "Machine Learning Engineer"
managerial = "Manager|Lead"
consultant = "Consultant"
```

We store these **filters in a list**:  

```python
job_categories = [
    data_science, data_analyst, data_engineer,
    ml_engineer, managerial, consultant
]
```

---

## 6️⃣ Creating a Categorical Column  

We use **NumPy’s `select()` function** to classify job roles:  

```python
import numpy as np

conditions = [
    salaries["Designation"].str.contains(data_science),
    salaries["Designation"].str.contains(data_analyst),
    salaries["Designation"].str.contains(data_engineer),
    salaries["Designation"].str.contains(ml_engineer),
    salaries["Designation"].str.contains(managerial),
    salaries["Designation"].str.contains(consultant)
]

categories = ["Data Science", "Data Analyst", "Data Engineer", 
              "ML Engineer", "Managerial", "Consultant"]

salaries["Job_Category"] = np.select(conditions, categories, default="Other")
```

📌 **Why use `np.select()`?**  
✔ Matches job titles to categories **efficiently**  
✔ Assigns **"Other"** if no match is found  

---

## 7️⃣ Previewing the New Column  

To **check correctness**:  

```python
salaries[["Designation", "Job_Category"]].head()
```

Example output:

| Designation         | Job_Category    |
|--------------------|----------------|
| Data Scientist    | Data Science    |
| ML Engineer       | ML Engineer     |
| Big Data Engineer | Data Engineer   |

✅ **Looks good!**  

---

## 8️⃣ Visualizing Job Category Distribution  

To **plot category frequency** using Seaborn:  

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(data=salaries, x="Job_Category")
plt.xticks(rotation=45)
plt.show()
```

🔍 **Insights:**  
- Most jobs fall under **Data Science, Engineer, and Analyst** roles  
- Very few jobs were classified as **Other**, meaning **high categorization accuracy**  

---

## 🎯 Let's Practice!  

✔ Try **extracting job roles** from a **new dataset**  
✔ **Define** your own categories  
✔ **Visualize** the frequency of categories  

This **enhances data quality** and **enables better analysis!** 🚀  

![image.png](attachment:8437aa3b-3772-4f16-8cfb-e351fa68c99e.png)

In [None]:
# Filter the DataFrame for object columns
non_numeric = planes.select_dtypes("object")

# Loop through columns
for col in non_numeric.columns:
  
  # Print the number of unique values
  print(f"Number of unique values in {col} column: ", non_numeric[col].nunique())

![image.png](attachment:c55c9284-94f7-465a-9db4-a7686cf713dd.png)

In [None]:
# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]

# Create short_flights
short_flights = "^0h|^1h|^2h|^3h|^4h"

# Create medium_flights
medium_flights = "^5h|^6h|^7h|^8h|^9h"

# Create long_flights
long_flights = "10h|11h|12h|13h|14h|15h|16h"

![image.png](attachment:b8516691-d8c0-417e-b395-3fb931c6fa9d.png)

In [None]:
# Create conditions for values in flight_categories to be created
conditions = [
    (planes["Duration"].str.contains(short_flights)),
    (planes["Duration"].str.contains(medium_flights)),
    (planes["Duration"].str.contains(long_flights))
]

# Apply the conditions list to the flight_categories
planes["Duration_Category"] = np.select(conditions, 
                                        flight_categories,
                                        default="Extreme duration")

# Plot the counts of each category
sns.countplot(data=planes, x="Duration_Category")
plt.show()

<a id='Working_with_numeric_data'></a>
# Working_with_numeric_data


Numeric data allows us to:  
✔ **Clean & convert values**  
✔ **Perform currency conversions**  
✔ **Calculate summary statistics**  
✔ **Add insights to the DataFrame**  

---

## 1️⃣ Loading the Salaries Dataset  

Let's print a **summary of the dataset**:  

```python
salaries.info()
```

🔍 **Observation:**  
- No `Salary_USD` column  
- Instead, there's a `Salary_In_Rupees` column  

---

## 2️⃣ Cleaning Salary Data  

### 🔹 Issue: Salary values contain **commas** & are stored as **strings**  
### 🔹 Solution: Remove commas & convert to **float**  

```python
salaries["Salary_In_Rupees"] = salaries["Salary_In_Rupees"].str.replace(",", "").astype(float)
```

✅ **No more commas!**  

---

## 3️⃣ Converting Rupees to USD  

We apply the **conversion rate**:  

```python
conversion_rate = 0.012  # 1 INR = 0.012 USD
salaries["Salary_USD"] = salaries["Salary_In_Rupees"] * conversion_rate
```

### 🔍 Checking Conversion  
```python
salaries[["Salary_In_Rupees", "Salary_USD"]].head()
```

📌 **Salary in USD** is **1.2% of INR salary**  

---

## 4️⃣ Calculating Mean Salary by Company Size  

Using `groupby()` to **get mean salary per company size**:  

```python
salaries.groupby("Company_Size")["Salary_USD"].mean()
```

---

## 5️⃣ Adding Summary Statistics  

### Standard Deviation of Salary **by Experience**  

```python
salaries["Salary_StdDev"] = salaries.groupby("Experience")["Salary_USD"].transform(lambda x: x.std())
```

### 🔍 Checking Results  
```python
salaries[["Experience", "Salary_StdDev"]].value_counts()
```

📌 **Senior-level (SE) employees have the highest salary variation!**  

---

## 6️⃣ Adding Median Salary by Company Size  

```python
salaries["Median_Salary"] = salaries.groupby("Company_Size")["Salary_USD"].transform(lambda x: x.median())
```

✅ **Medium-sized companies have the highest median salaries!**  

---

## 🎯 Let's Practice!  

✔ Try **converting different currencies**  
✔ Compute **more summary statistics**  
✔ **Visualize salary trends**  

🚀 **Numeric analysis is powerful—let’s explore more!**

In [None]:
<a id='Refer_to'></a>
# Refer_to

In [None]:
<a id='Refer_to'></a>
# Refer_to

In [None]:
<a id='Refer_to'></a>
# Refer_to