## Intro to Pandas

Learning Objectives:

  Set up environment with appropraite packages and libraries.

  Obtain a basic understanding of Pandas Data Structure: Series and DataFrames.

  Read from CSV files.

  Access and manipulate data.

  Create and manipulate visualizations for data using Pandas, Matplotlib, and Seaborn.

  Gain an overview of Supervised and Unsupervised Learning Models.
  
  Gain an understanding of Gaussian Models and Model Selection.

What is Pandas:

Pandas is an open-source Python library created for data visualtion and analysis. Pandas allows the user to accumulate, view, manipulate, and filter data to suit their needs. It utilizes NumPy as its foundation (it is an extension of Numpy). It has many uses, such as: data cleaning, data analysis, data tranformation, and machine learning.

## Environment Setup (Jupyter Notebook VS Extension)

Package Installation:

In [None]:
# In the terminal enter the following
pip install pandas # Installs Pandas and Numpy
pip install matplotlib # Installs Matplotlib
pip install seaborn # Installs Seaborn

Import Libraries:

In [None]:
import numpy as np # Imports numpy with alias "np"
import pandas as pd # Imports pandas with alias "pd"
import matplotlib.pyplot as plt # Imports Matplotlib with alias "plt"
import seaborn as sns # Imports Seaborn with alias "sns"

## Core Data Structures

The Core Data Structures in Pandas are Series Objects and Data Frames.

A Series is a data structure created from one dimensional array with an index number for each value.

A DataFrame is a data structure created from two dimensional arrays of dictionaries with labeled axes. They are labeled using keys and numbered based on index.

You can create your own of each.

Series Object:

In [None]:
pd.Series(["Dogs", "Cats", "Lions", "Tigers", "Bears"])

DataFrame:

In [None]:
species = pd.Series(["Dogs", "Cats", "Lions", "Tigers", "Bears"])
area_pop = pd.Series([1234, 4598, 7, 3, 99])

pd.DataFrame({ "Species": species, "Population": area_pop })

## Using Data from Files

Data Frames are usually created from existing data. Such as CSV files.

Read CSV:

In [None]:
imdb_dataframe = pd.read_csv("./imdb_top_250.csv") # Import data from CSV to DataFrame
imdb_dataframe.describe() # Output Statistics

Popular methods

In [None]:
imdb_dataframe.info() # Ouput summary of Dataframe
imdb_dataframe.columns # Ouput column names

## Data Manipulation

Data can be accessed by column name, index or a range of indices. 

Data output can be manipulated or filtered by using ranges or formulas.

Accessing:

In [None]:
print(imdb_dataframe["Title"], "\n") # Ouput title column
print(imdb_dataframe.loc[0], "\n") # Select first row
print(imdb_dataframe.iloc[0:2], "\n") # Select first two rows
print(imdb_dataframe["Title"][124], "\n") # Select 125th title
print(imdb_dataframe["Title"][20:30]) # Select titles with indexes in the 20s



Manipulating:

In [None]:
imdb_dataframe[imdb_dataframe["Rank"] >= 25] # Ouput top 25 movies
imdb_dataframe[imdb_dataframe["Title"].str[0].between("M", "U")] # Ouput movies with title between M and U

Add and Remove:

In [None]:
pd_companies = [("Paramount" * 100), ("Disney" * 100), ("Sony" * 50)] # Create new column
imdb_dataframe["Production Company"] = pd_companies # Add column
imdb_dataframe.drop("Production Company", axis = 1, inplace = True) # Remove Column

Handeling missing data:

In [None]:
imdb_dataframe.fillna(0, inplace = True) # Put a 0 in areas with missing data
imdb_dataframe.dropna(inplace = True) # Remove results with missing data

## Data Visualization

Data can be visualized using Pandas Plots, Matplotlib, and Seaborn.

There are various chart types that can be used based on the data type and desired message.

Some of the most used charts are: Bar Charts, Line Charts, Scatter Plots, and Pie Charts.

Pandas Plotting:

In [None]:
import pandas as pd

new_df = imdb_dataframe[["Title", "Rating"]].copy() # Create dataframe with columns to plot
new_df.set_index("Title", inplace = True) # Set index based on title
imdb_dataframe["Rating"].plot(kind = "bar", title = "IMDb Ratings", label = "Rating", xlabel = "Title") # Plot data

Matplotlib:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
# Scatter plot comparing correlation between year and rank of films
plt.figure(figsize = (10, 6)) # Set size of chart
plt.scatter(imdb_dataframe["Year"], imdb_dataframe["Rank"], color = "red", edgecolor = "black") # Plot data
plt.title("Rank vs Year", fontsize = 24) # Chart name
plt.xlabel("Year") # X axis label
plt.ylabel("Rank") # Y axis label
plt.show() # Output chart

Seaborn:

In [None]:
import pandas as ps
import matplotlib.pyplot as plt
import seaborn as sns
# Create the line plot
plt.figure(figsize=(10, 6))# Set size of chart
sns.lineplot(data = imdb_dataframe, x = 'Year', y = 'Rating', marker = 'o', color = 'orange') # Plot data
plt.title('Ratings Over Time') # Chart name
plt.xlabel("Year") # X axis label
plt.ylabel("Rating") # Y axis label
plt.show() # Ouput chart

## Supervised Learning

Supervised learning is a way to train machine learning models using labeled databases. Pandas is used for data cleaning and manipulation before feeding it to the machine. Linear regression and decision trees are the main models used in supervised learning.

Linear Regression:

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

costs = {
    "Monthly Salary": [3000, 4000, 5000, 6000, 7000],
    "Money Spent": [1500, 2000, 2500, 3000, 3500]
} # Data
cost_df = pd.DataFrame(costs) # Create DataFrame
x = cost_df[["Monthly Salary"]]  # Predictor
y = cost_df["Money Spent"]       # Target
model = LinearRegression() # Train Model
model.fit(x, y) # Relate
prediction = model.predict([[5500]]) # Predict costs on $5500 salary
print(f"Predicted money spent for $5500 salary: ${prediction[0]:.2f}")

Decision Tree:

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

movies = {
    "Budget": [50, 80, 150, 200],
    "Rating": [6.5, 7.0, 8.0, 8.5],
    "Genre": ["Comedy", "Comedy", "Action", "Action"]
}
df = pd.DataFrame(movies) # Create DataFrame

x = df[["Budget", "Rating"]] # Features (predictors)
y = df["Genre"]              # Target (what to predict)

tree = DecisionTreeClassifier() # Create Decision Tree model
tree.fit(x, y) # Train the model on the data

prediction = tree.predict([[100, 7.5]]) # Predict genre for new movie
print(f"Predicted Genre: {prediction[0]}") # Output the prediction

## Unsupervised Learning

Unsupervised learning is a way to train machine learning models using unlabeled databases. Pandas is used for data cleaning and manipulation before feeding it to the machine. Clustering and dimensionality reduction are the main techniques used in unsupervised learning.

Clustering (K-Means):

In [None]:
import pandas as pd
from sklearn.cluster import KMeans

data = {
    "Monthly Salary": [3000, 4000, 5000, 7000, 8000],
    "Money Spent": [1500, 2000, 2500, 3300, 3500]
}
df = pd.DataFrame(data) # Create DataFrame

kmeans = KMeans(n_clusters=2, random_state=0) # Create clustering model
df["Cluster"] = kmeans.fit_predict(df) # Fit model and label each row with a cluster
print(df) # Output clustered data

Dimensionality Reduction:

In [None]:
import pandas as pd
from sklearn.decomposition import PCA

data = {
    "Salary": [3000, 4000, 5000, 6000],
    "Spending": [1500, 2000, 2500, 3000],
    "Savings": [500, 800, 1000, 1300]
}
df = pd.DataFrame(data) # Create DataFrame

pca = PCA(n_components=2) # Create PCA model to reduce dimensions to 2
reduced = pca.fit_transform(df) # Apply PCA to the data

reduced_df = pd.DataFrame(reduced, columns=["PC1", "PC2"]) # Convert result to DataFrame
print(reduced_df) # Output reduced data

## Gaussian Mixture Models

Gaussian Mixture Models are a way to group data based on probability distributions in unsupervised learning. Pandas is used for organizing and preparing the data before applying the model. GMM identifies hidden clusters by assuming the data is made up of overlapping Gaussian distributions

In [None]:
import pandas as pd
from sklearn.mixture import GaussianMixture

data = {
    "Monthly Salary": [3000, 4000, 5000, 7000, 8000],
    "Money Spent": [1500, 2000, 2500, 3300, 3500]
}
df = pd.DataFrame(data) # Create DataFrame

gmm = GaussianMixture(n_components=2, random_state=0) # Create GMM with 2 groups
gmm.fit(df) # Fit the model to the data

df["Group"] = gmm.predict(df) # Assign each row to a Gaussian group
print(df) # Output the grouped data

## Model Selection

Model selection is a way to choose the best-performing machine learning model for a specific task. Pandas is used to prepare and structure the data before testing different models. Techniques like train-test split and cross-validation help compare models based on accuracy or other metrics.

References

McKinney, W. (2023). Pandas installation guide. Pandas Documentation https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html

Hunter, J. D. (2023). Matplotlib installation guide. Matplotlib Documentation. https://matplotlib.org/stable/users/installing.html

Waskom, M. L. (2023). Seaborn installation guide. Seaborn Documentation. https://seaborn.pydata.org/installing.html

McKinney, W. (2023). Pandas data structures: Series and DataFrames. Pandas Documentation. https://pandas.pydata.org/pandas-docs/stable/dsintro.html

McKinney, W. (2023). read_csv() documentation. Pandas Documentation. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

McKinney, W. (2023). DataFrame operations. Pandas Documentation. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

McKinney, W. (2023). Accessing data in Pandas. Pandas Documentation. https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#accessing-data

Hunter, J. D. (2023). Matplotlib documentation. Matplotlib Documentation. https://matplotlib.org/stable/contents.html

Hunter, J. D. (2023). Matplotlib gallery. Matplotlib Documentation. https://matplotlib.org/stable/gallery/index.html

Waskom, M. L. (2023). Seaborn documentation. Seaborn Documentation. https://seaborn.pydata.org/

McKinney, W. (2023). Pandas plotting guide. Pandas Documentation. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, É. (2021). Supervised learning. Scikit-learn Documentation. https://scikit-learn.org/stable/supervised_learning.html

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, É. (2021). Unsupervised learning. Scikit-learn Documentation. https://scikit-learn.org/stable/unsupervised_learning.html

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, É. (2021). Gaussian mixture models (GMM). Scikit-learn Documentation. https://scikit-learn.org/stable/modules/mixture.html#gaussian-mixture

Furlan, R. (2020, September 24). A guide to model selection in machine learning. Towards Data Science. https://towardsdatascience.com/a-guide-to-model-selection-in-machine-learning-bbe9b2cba5d7
