# Data Processing and Manipulation with Pandas and NumPy

Welcome to this workshop! In this session, we'll cover:

- Data processing and manipulation using **Pandas**
- Performing equivalent operations using **NumPy**
- Utilizing Pandas' native plotting features for quick data exploration

## Table of Contents

1. [Introduction](#introduction)
2. [Data Loading and Exploration](#data_loading)
3. [Data Manipulation with Pandas](#data_manipulation_pandas)
4. [Equivalent Operations with NumPy](#equivalent_numpy)
5. [Pandas Native Plotting](#pandas_plotting)
6. [Conclusion](#conclusion)

---

<a id='data_loading'></a>
## 1. Data Loading and Exploration

We'll start by loading a sample dataset and exploring its contents.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

### Loading the Dataset

- We'll use the **Titanic** dataset for demonstration.
- This dataset contains information about the passengers on the Titanic.

In [None]:
# Load the Titanic dataset from a URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Display the first few rows
df.head()

### Data Overview

- **Columns**:
  - `PassengerId`, `Survived`, `Pclass`, `Name`, `Sex`, `Age`, `SibSp`, `Parch`, `Ticket`, `Fare`, `Cabin`, `Embarked`.
- Use `df.info()` and `df.describe()` to get more details.

In [None]:
# Get information about the dataset
df.info()

In [None]:
# Get statistical summary
df.describe()

<a id='data_manipulation_pandas'></a>
## 2. Data Manipulation with Pandas

We'll perform common data manipulation tasks using Pandas.

### Selecting Columns

- Select a single column: `df['Age']`
- Select multiple columns: `df[['Age', 'Fare']]`

In [None]:
# Select 'Age' and 'Fare' columns
df_subset = df[['Age', 'Fare']]
df_subset.head()

### Filtering Rows

- Filter rows where `Age` > 30: `df[df['Age'] > 30]`
- Filter rows where `Sex` is `female`: `df[df['Sex'] == 'female']`

In [None]:
# Filter passengers older than 30
df_over_30 = df[df['Age'] > 30]
df_over_30.head()

### GroupBy Operations

- Calculate mean age by sex: `df.groupby('Sex')['Age'].mean()`
- Count survivors by passenger class: `df.groupby('Pclass')['Survived'].sum()`

In [None]:
# Mean age by sex
mean_age_by_sex = df.groupby('Sex')['Age'].mean()
mean_age_by_sex

In [None]:
df.groupby('Pclass')['Survived'].sum()


### Adding New Columns

- Create a new column `FamilySize` as the sum of `SibSp` and `Parch`.

In [None]:
# Add a new column 'FamilySize'
df['FamilySize'] = df['SibSp'] + df['Parch']
df[['SibSp', 'Parch', 'FamilySize']].head()

### Handling Missing Values

- Check for missing values: `df.isnull().sum()`
- Fill missing `Age` values with the mean age: `df['Age'].fillna(df['Age'].mean(), inplace=True)`
- Drop rows with missing `Embarked`: `df.dropna(subset=['Embarked'], inplace=True)`

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Fill missing 'Age' values with mean age
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Drop rows with missing 'Embarked'
df.dropna(subset=['Embarked'], inplace=True)

# Verify missing values again
df.isnull().sum()

<a id='equivalent_numpy'></a>
## 3. Equivalent Operations with NumPy

We'll demonstrate how to perform similar data manipulation tasks using NumPy arrays.

### Converting DataFrame to NumPy Array

- Use `.values` attribute: `df.values`
- Select specific columns: `df[['Age', 'Fare']].values`

In [None]:
# Load the Titanic dataset from a URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Display the first few rows
df.head()

# Convert 'Age' and 'Fare' columns to NumPy array
data_array = df[['Age', 'Fare']].values
print(data_array[:5])

### Filtering Data

- Filter rows where `Age` > 30: `data_array[data_array[:, 0] > 30]`

In [None]:
# Filter rows where 'Age' > 30
filtered_array = data_array[data_array[:, 0] > 30]
print(filtered_array[:5])

### Calculations

- Calculate mean age: `np.mean(data_array[:, 0])`
- Sum of fares: `np.sum(data_array[:, 1])`

In [None]:
# Calculate mean age and sum of fares
mean_age = np.mean(data_array[:, 0])
total_fare = np.sum(data_array[:, 1])
print(f"Mean Age: {mean_age}")
print(f"Total Fare: {total_fare}")

### Handling Missing Values in NumPy

- Use `np.isnan()` to identify `NaN` values.
- Replace `NaN` with mean value: `data_array[np.isnan(data_array)] = mean_value`

In [None]:
# Identify NaN values
nan_indices = np.isnan(data_array)
print(f"Number of NaNs: {np.sum(nan_indices)}")

# Replace NaNs with mean age for 'Age' column
mean_age = np.nanmean(data_array[:, 0])
data_array[np.isnan(data_array[:, 0]), 0] = mean_age

# Verify NaN replacement
print(f"Number of NaNs after replacement: {np.sum(np.isnan(data_array))}")

In [None]:
mean_age = np.mean(data_array[:, 0])
total_fare = np.sum(data_array[:, 1])
print(f"Mean Age: {mean_age}")
print(f"Total Fare: {total_fare}")

<a id='pandas_plotting'></a>
## 4. Pandas Native Plotting

Pandas provides built-in plotting functions for quick data visualization.

### Histogram

- Plot a histogram of the `Age` column: `df['Age'].plot.hist()`

In [None]:
# Plot histogram of 'Age'
df['Age'].plot.hist(bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Passenger Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

### Bar Plot

- Plot the number of passengers by `Pclass`: `df['Pclass'].value_counts().plot.bar()`

In [None]:
# Bar plot of passenger class counts
df['Pclass'].value_counts().sort_index().plot.bar(color='orange', edgecolor='black')
plt.title('Number of Passengers by Class')
plt.xlabel('Passenger Class')
plt.ylabel('Number of Passengers')
plt.show()

### Scatter Plot

- Create a scatter plot of `Age` vs. `Fare`: `df.plot.scatter(x='Age', y='Fare')`

In [None]:
# Scatter plot of 'Age' vs. 'Fare'
df.plot.scatter(x='Age', y='Fare', alpha=0.5, color='green')
plt.title('Age vs. Fare')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

### Box Plot

- Create a box plot of `Age` grouped by `Pclass`: `df.boxplot(column='Age', by='Pclass')`

In [None]:
# Box plot of 'Age' by 'Pclass'
df.boxplot(column='Age', by='Pclass', grid=False)
plt.title('Age Distribution by Passenger Class')
plt.suptitle('')  # Remove the automatic 'Boxplot grouped by ...' title
plt.xlabel('Passenger Class')
plt.ylabel('Age')
plt.show()

<a id='conclusion'></a>
## 5. Conclusion

In this workshop, we've:

- Explored data loading and initial exploration with Pandas.
- Performed data manipulation tasks using Pandas DataFrames.
- Demonstrated how to perform equivalent operations using NumPy arrays.
- Utilized Pandas' native plotting functions for quick data visualization.

**Next Steps:**

- Practice with your own datasets to reinforce these concepts.
- Explore more advanced data manipulation techniques in Pandas and NumPy.
- Learn about data visualization libraries like Seaborn and Plotly for enhanced plots.

---