# Pandas Fundamentals

## 1. What is Pandas and why do we use it?
Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like Series and DataFrames, making it easy to work with structured data. We use Pandas for tasks such as data cleaning, transformation, and exploration, which are essential steps in the data analysis process.


## 2. Creating DataFrames
DataFrames are two-dimensional, size-mutable, and potentially heterogeneous tabular data structures with labeled axes (rows and columns). They are similar to spreadsheets or SQL tables and are one of the most commonly used data structures in Pandas. Here are some common ways to create DataFrames in Pandas:
1. **From a Dictionary**:
   You can create a DataFrame from a dictionary where the keys are column names and the values are lists of column data.
   ```python
   import pandas as pd

   data = {
       'Name': ['Alice', 'Bob', 'Charlie'],
       'Age': [25, 30, 35],
       'City': ['New York', 'Los Angeles', 'Chicago']
   }
   df = pd.DataFrame(data)
   print(df)
   ```
2. **From a List of Dictionaries**:
    You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row of data.
    ```python
    data = [
        {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
        {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
        {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
    ]
    df = pd.DataFrame(data)
    print(df)
    ```
3. **From a CSV File**:
   You can read data from a CSV file directly into a DataFrame using the `read_csv` function.
   ```python
   df = pd.read_csv('data.csv')
   print(df)
   ```
4. **From an Excel File**:
   You can read data from an Excel file using the `read_excel` function.
   ```python
   df = pd.read_excel('data.xlsx')
   print(df)
   ```
5. **From a NumPy Array**:
   You can create a DataFrame from a NumPy array by specifying the column names.
   ```python
   import numpy as np
   data = np.array([
       [25, 'New York'],
       [30, 'Los Angeles'],
       [35, 'Chicago']
   ])
   df = pd.DataFrame(data, columns=['Age', 'City'])
   print(df)
   ``` 

## 3-Data Manipulation with Pandas
Pandas provides a wide range of functions for data manipulation, including filtering, sorting, grouping, and aggregating data. Here are some common data manipulation tasks you can perform with Pandas:
1. **Filtering Data**:
   You can filter rows in a DataFrame based on certain conditions.
   ```python
   # Filter rows where Age is greater than 30
   filtered_df = df[df['Age'] > 30]
   print(filtered_df)
   ```
2. **Sorting Data**:
   You can sort a DataFrame by one or more columns.
   ```python
   # Sort by Age in descending order
   sorted_df = df.sort_values(by='Age', ascending=False)
   print(sorted_df)
   ```
3. **Grouping Data**:
    You can group data by one or more columns and perform aggregate functions on the groups.
    ```python
    # Group by City and calculate the average Age
    grouped_df = df.groupby('City')['Age'].mean().reset_index()
    print(grouped_df)
    ```
4. **Adding/Removing Columns**:
   You can add new columns or remove existing ones from a DataFrame.
   ```python
   # Add a new column 'Salary'
   df['Salary'] = [50000, 60000, 70000]
   print(df)

   # Remove the 'City' column
   df = df.drop(columns=['City'])
   print(df)
   ```
5. **Handling Missing Data**:
   You can handle missing data by either filling it with a specific value or dropping rows/columns with missing values.
   ```python
   # Fill missing values with the mean of the column
   df['Age'] = df['Age'].fillna(df['Age'].mean())
   print(df)

   # Drop rows with any missing values
   df = df.dropna()
   print(df)
   ```


## 4- Creating plots and visualizations
Pandas integrates well with Matplotlib, allowing you to create various types of plots and visualizations directly from DataFrames. Here are some common types of plots you can create using Pandas:
#### Choosing the Right Plot Type
- **Categorical Data** (e.g., brand, gender, country):
  - **Bar Plot** → Compare frequencies or averages across categories.
  - **Pie Chart** → Show proportions of categories (better for small number of groups).

- **Numerical Data (1 variable)**:
  - **Histogram** → Show the distribution of values.
  - **Box Plot** → Detect spread, median, and outliers.

- **Numerical vs Numerical (2 variables)**:
  - **Scatter Plot** → Show relationships or correlations.
  - **Line Plot** → Use when data has an order (e.g., time series).

- **Categorical vs Numerical**:
  - **Bar Plot** → Compare mean/median values across categories.
  - **Box Plot / Violin Plot** → Show distribution of numerical data for each category.

- **Time Series Data**:
  - **Line Plot** → Best for trends over time.



![mlconcepts_image6](/S1-Intro_to_ML/images/plots.png)

## 5- Examples of Pandas in Machine Learning
### Data Cleaning and Preparation
Pandas is often used in the initial stages of a machine learning project for data cleaning and preparation. This includes handling missing values, removing duplicates, and transforming categorical variables into numerical formats. For example, you can use Pandas to fill missing values with the mean or median of a column, drop rows with missing values, or convert categorical variables into one-hot encoded vectors.

In [None]:
import pandas as pd
import numpy as np
import kagglehub
import os
import glob
# Data loading
path=kagglehub.dataset_download("camnugent/california-housing-prices")
csv_files = glob.glob(path + "/*.csv")
dfs = [pd.read_csv(file).assign(brand=os.path.splitext(os.path.basename(file))[0]) for file in csv_files] #os-> opertaing system
df_combined = pd.concat(dfs, ignore_index=True)

print(df_combined.shape) 
print(df_combined.head())

In [None]:
#Data Handling with Pandas
#1. Data Inspection
print(df_combined.info())
print(df_combined.describe())
print(df_combined.isnull().sum())
#2. Data Selection
# Selecting specific columns
selected_columns = df_combined[['longitude', 'latitude', 'median_house_value']]
print(selected_columns.head())
# Filtering rows based on conditions
filtered_rows = df_combined[df_combined['median_house_value'] > 500000]
print(filtered_rows.head())
#3. Data Cleaning
# Handling missing values by filling with the mean
df_combined['total_bedrooms'].fillna(df_combined['total_bedrooms'].mean(), inplace=True)
# Removing duplicates
df_combined.drop_duplicates(inplace=True)

In [None]:
#Feature Engineering
# Creating a new feature: rooms_per_household
df_combined['rooms_per_household'] = df_combined['total_rooms'] / df_combined['households']
print(df_combined[['rooms_per_household']].head())
# Creating a new feature: bedrooms_per_room
df_combined['bedrooms_per_room'] = df_combined['total_bedrooms'] / df_combined['total_rooms']
print(df_combined[['bedrooms_per_room']].head())
# Creating a new feature: population_per_household
df_combined['population_per_household'] = df_combined['population'] / df_combined['households']
print(df_combined[['population_per_household']].head())
#4. Data Transformation
# Normalizing the 'median_house_value' column
df_combined['median_house_value_normalized'] = (df_combined['median_house_value'] - df_combined['median_house_value'].min()) / (df_combined['median_house_value'].max() - df_combined['median_house_value'].min())
print(df_combined[['median_house_value_normalized']].head())
# Encoding categorical variables using one-hot encoding
df_combined = pd.get_dummies(df_combined, columns=['ocean_proximity'], drop_first=True)
print(df_combined.head())
#5. Data Aggregation
# Grouping by 'ocean_proximity' and calculating the mean of 'median_house_value'
mean_values = df_combined.groupby('ocean_proximity_INLAND')['median_house_value'].mean()
print(mean_values)

# Intgration with NumPy
Pandas is built on top of NumPy, and it seamlessly integrates with NumPy arrays. You can easily convert Pandas DataFrames to NumPy arrays and vice versa. This is particularly useful when you need to perform numerical computations or use machine learning libraries that require NumPy arrays as input.


In [None]:
#Seamless data type conversion
# Converting DataFrame to NumPy array
numpy_array = df_combined.to_numpy()
print(numpy_array[:5])
import matplotlib.pyplot as plt
# Visualizing the distribution of 'median_house_value'
plt.hist(df_combined['median_house_value'], bins=50, edgecolor='k')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')
plt.title('Distribution of Median House Value')
plt.show()

In [None]:
#Using NumPy functions on Pandas DataFrame
# Calculating the mean and standard deviation of 'median_house_value' using NumPy
mean_value = np.mean(df_combined['median_house_value'])
std_value = np.std(df_combined['median_house_value'])
print(f"Mean: {mean_value}, Standard Deviation: {std_value}")


## Key benefits of this integration include:
1. **Ease of Use**: Pandas provides a user-friendly interface for data manipulation, while NumPy offers powerful numerical operations. Together, they make it easier to work with data and perform complex calculations.
2. **Performance**: NumPy is optimized for performance, and using NumPy functions on Pandas DataFrames can lead to faster computations compared to using pure Pandas methods.
3. **Flexibility**: You can leverage the strengths of both libraries, using Pandas for data manipulation and NumPy for numerical computations, allowing for more flexible and efficient workflows.

For more information, please refer to the [Pandas Documentation](https://pandas.pydata.org/docs/).