# Session 4 🐍

☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️

***

# 24. Exploratory Data Analysis (EDA) 
**EDA** is the critical first step in any data science project where you investigate, visualize, and summarize datasets to understand their main characteristics, patterns, and potential issues before applying formal modeling techniques.

***

# 25. Python's Role in EDA
Python has become the dominant language for EDA due to its rich ecosystem of data analysis libraries and its intuitive syntax. The key advantages include **Core Libraries**:
- **Pandas** - Data manipulation and analysis, providing DataFrames for structured data handling
- **NumPy** - Numerical computing and array operations
- **Matplotlib** - Basic plotting and visualization
- **Seaborn** - Statistical visualizations with attractive defaults
- **Plotly** - Interactive visualizations
- **Scipy** - Statistical functions and tests

***

# 26. Pandas 
Pandas is the most widely used Python library for data manipulation and analysis. It provides DataFrame and Series structures to handle structured data efficiently. 

***

# 27. Creating DataFrames

## 27-1. From a Dictionary

In [1]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "London", "Paris"]
}

import pandas as pd
df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris


***

## 27-2. From a CSV File

In [2]:
import pandas as pd
df = pd.read_csv("data.csv")  
df.head()  # Displays first 5 rows

Unnamed: 0,title,author,year,art_type
0,The Starry Night,Vincent van Gogh,1889,painting
1,Mona Lisa,Leonardo da Vinci,1503,painting
2,David,Michelangelo,1504,sculpture
3,The Thinker,Auguste Rodin,1902,sculpture


***

## 27-3. From Excel

In [None]:
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name="SalesOrders")
print(df.head())

***

# 28. Basic DataFrame Operations

***

## 28-1. Viewing Data

In [None]:
df.head()      # First 5 rows
df.tail(3)     # Last 3 rows
df.sample(2)   # Random 2 rows
df.shape       # (rows, columns)
df.columns     # List of column names
df.info()      # Data types & memory usage
df.describe()  # Summary statistics for numeric columns

***

## 28-2. Selecting Columns

In [None]:
df["Name"]           # Single column (returns Series)
df[["Name", "Age"]]  # Multiple columns (returns DataFrame)

***

## 28-3. Filtering Rows

In [None]:
df[df["Age"] > 30]                                # Age > 30
df[(df["Age"] > 25) & (df["City"] == "London")]   # Multiple conditions
df.query("Age > 25 and City == 'London'")         # SQL-like query

***

# 29. Data Cleaning & Preprocessing

***

## 29-1. Handling Missing Data (NaN)

In [None]:
df.isna().sum()                                       # Count missing values per column
df.dropna()                                           # Drop rows with missing values
df.fillna(0)                                          # Replace NaN with 0
df["Age"].fillna(df["Age"].mean(), inplace=True)      # Fill with mean

***

## 29-2. Removing Duplicates

In [None]:
df.drop_duplicates()  # Removes duplicate rows

***

## 29-3. Renaming Columns

In [None]:
df.rename(columns={"Name": "Full Name", "Age": "Years"}, inplace=True)

***

## 29-4. Changing Data Types

In [None]:
df["Age"] = df["Age"].astype("float")  # Convert to float

***

# 30. Data Manipulation

***

## 30-1. Adding New Columns

In [None]:
df["Age_Next_Year"] = df["Age"] + 1

***

## 30-2. Applying Functions to Columns

In [None]:
df["Name_Length"] = df["Name"].apply(len)  # Apply a function
df["Age_Group"] = df["Age"].apply(lambda x: "Young" if x < 30 else "Old")

***

## 30-3. Sorting Data

In [None]:
df.sort_values("Age", ascending=False)  # Sort by Age (descending)

***

## 30-4. Grouping & Aggregation

In [None]:
df.groupby("City")["Age"].mean()  # Average age per city
df.groupby("City").agg({"Age": ["mean", "max"], "Name": "count"})  # Multiple aggregations

***

# 31. Merging & Joining DataFrames

***

## 31-1. Concatenation

In [None]:
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"A": [5, 6], "B": [7, 8]})
pd.concat([df1, df2], axis=0)  # Stack vertically (rows)

***

## 31-2. Joining (Like SQL)

In [None]:
left = pd.DataFrame({"key": ["A", "B"], "value": [1, 2]})
right = pd.DataFrame({"key": ["A", "B"], "value": [3, 4]})
pd.merge(left, right, on="key", how="inner")  # Inner join

***

# 32. Time Series Handling

***

## 32-1. Working with Dates

In [None]:
df["Date"] = pd.to_datetime(df["Date"])  # Convert to datetime
df["Year"] = df["Date"].dt.year          # Extract year
df["Month"] = df["Date"].dt.month        # Extract month

***

## 32-2. Resampling (Aggregating Over Time)

In [None]:
df.set_index("Date", inplace=True)
df.resample("M").mean()  # Monthly average

***

# 33. Exporting Data

In [None]:
df.to_csv("output.csv", index=False)  # Save to CSV
df.to_excel("output.xlsx", sheet_name="Data")  # Save to Excel

***

# 34. Advanced Features

***

## 34-1. Pivot Tables

In [None]:
df.pivot_table(index="City", columns="Age_Group", values="Age", aggfunc="mean")

***

## 34-2. Handling Categorical Data

In [None]:
df["City"] = df["City"].astype("category")  # More memory-efficient

***

## 34-3. String Operations

In [None]:
df["Name"].str.upper()       # Convert to uppercase
df["Name"].str.contains("A") # Check if name contains "A"

***

# 35. Performance Optimization

***

## 35-1. Vectorized Operations (Faster Than Loops)

In [None]:
df["Age_Squared"] = df["Age"] ** 2  # Much faster than looping

***

## 35-2. Using .loc & .iloc for Indexing

In [None]:
df.loc[0, "Name"]    # Access by label
df.iloc[0, 1]        # Access by position

***

***

# Some Excercises

**1.**  Create a DataFrame from three different sources using pandas:
- A Python dictionary of student grades
- A CSV file of weather data
- A list of dictionaries containing product information

___

**2.** Given a messy dataset with:
- Missing values in age column
- Inconsistent capitalization in city names
- Duplicate records
- Invalid date formats

Write a cleaning pipeline using pandas.

---

**3.**  Using pandas analyze the Titanic dataset to:
- Calculate survival rates by passenger class
- Find age distribution differences between survivors/non-survivors
- Identify any correlations between fare price and survival

---

**4.**  Using pandas transform a sales DataFrame to:
- Create a new "profit" column (revenue - cost)
- Group sales by month
- Find the 3 best-selling products
- Calculate rolling 7-day averages

***

**5.** Using pandas combine three datasets:
- Customer information (name, ID)
- Purchase history (transaction logs)
- Product catalog (item details)
- Create a unified view showing customer purchase patterns.

***

**6.** Using pandas analyze stock price data to:
- Resample daily prices to weekly averages
- Calculate 30-day moving volatility
- Identify days with abnormal price movements (>2 std deviations)

***

**7.** Using a large e-commerce dataset:
- Create a pivot table of sales by category and region
- Use pd.cut() to bin continuous variables
- Implement a memory optimization strategy

***

**8.** Build an end-to-end data pipeline:
- Import JSON API data
- Clean and validate the data
- Perform exploratory analysis
- Export to both Excel and SQLite
- Create summary statistics report

***

#                                                        🌞 https://github.com/AI-Planet 🌞