<a href="https://colab.research.google.com/github/Achiever-caleb/Machine_Learning_Tutorials/blob/main/Pandas_Numpy_Matplotlib_Plotly_and_Seaborn_for_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas, Numpy, Matplotlib, Plotly and Seaborn for Machine Learning

Brief Overview

Pandas

- Purpose: Handles structured data (tables) and provides powerful tools for data manipulation and analysis.
- Key Features:
DataFrame and Series objects for data storage.
Built-in functions for aggregation, filtering, and cleaning.

Numpy

- Purpose: Provides support for numerical computations and handling multi-dimensional arrays.
- Key Features:
Efficient operations on large datasets.
Mathematical and statistical functions.

Plotly

- Purpose: Visualization libraries to create dynamic, interactive, and publication-quality plots.


Matplotlib/Seaborn

- Purpose: Visualization libraries to create static, interactive, and publication-quality plots.
- Key Features:
Matplotlib: Basic plots like bar, line, scatter.
Seaborn: High-level interface with advanced statistical plots.

## Pandas

- Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like DataFrames and Series that make working with structured data easy and intuitive.

In [None]:
import pandas as pd
import numpy as np

### Series

- A Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, Python objects, etc.). It's like a column in a table.

In [None]:
# Creating a Series
print("\n# Creating a Series")
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print("Series from a list:\n", s)

data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)
print("\nSeries from a dictionary:\n", s)

In [None]:
# Accessing elements
print("\n# Accessing elements in a Series")
print("Element at index 0:", s[0])
print("Element with label 'b':", s['b'])

### DataFrame
- A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It's like a table or a spreadsheet.")


In [None]:
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 22, 28, 24],
        'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
        'Salary': [60000, np.nan, 55000, 70000, 62000]}
df = pd.DataFrame(data)
print("DataFrame from a dictionary:\n", df)

In [None]:
data = [['Alice', 25, 'New York', 60000],
        ['Bob', 30, 'London', np.nan],
        ['Charlie', 22, 'Paris', 55000],
        ['David', 28, 'Tokyo', 70000],
        ['Eve', 24, 'Sydney', 62000]]
df2 = pd.DataFrame(data, columns=['Name', 'Age', 'City', 'Salary'])
print("\nDataFrame from a list of lists:\n", df2)

### Basic DataFrame Operations


- Accessing columns

In [None]:
print("\nAccessing a column:")
print(df['Name'])
print("\nAccessing multiple columns:")
print(df[['Name', 'Salary']])

Accessing rows (using .loc and .iloc)

- iloc
    
  - Integer-based indexing: iloc selects rows and columns by their integer position.
 - Zero-based indexing: The first row and column have an index of 0.
Inclusive slicing: When slicing with iloc, the end index is included.

- loc

 - Label-based indexing: loc selects rows and columns by their labels.
 - Inclusive slicing: Similar to iloc, the end index is included.

In [None]:
print("\nAccessing rows:")
print("First row (using .iloc): \n", df.iloc[0]) #integer based indexing
print("Row with index 2 (using .loc): \n", df.loc[2]) #label based indexing. Since we did not specify an index when creating the dataframe, by default it is numerical, so in this case it is the same as iloc
print("Rows from index 1 to 3 (using .iloc): \n", df.iloc[1:4])
print("Rows from index 1 to 3 (using .loc): \n", df.loc[1:3]) #loc is inclusive of the last element

- Adding a new column

In [None]:

print("\nAdding a new column:")
df['Bonus'] = [5000, 0, 3000, 7000, 2000]
print(df)



- Deleting a column

In [None]:
print("\nDeleting a column:")
df_copy = df.copy() #create copy so original df is not modified
df_copy = df_copy.drop('Bonus', axis=1) #axis=1 specifies column
print(df_copy)

- Filtering Data

In [None]:
print("\n# Filtering Data")
print("People older than 25:\n", df[df['Age'] > 25])
print("\nPeople from New York:\n", df[df['City'] == "New York"])
print("\nPeople with salary greater than 60000 and from Tokyo:\n", df[(df['Salary'] > 60000) & (df['City'] == "Tokyo")])


- Handling Missing Data

In [None]:
print("DataFrame with missing data:\n", df)

print("\nChecking for missing values:")
print(df.isnull())

print("\nNumber of missing values in each column:")
print(df.isnull().sum())

print("\nFilling missing salary with 0:")
df_filled = df.copy()
df_filled['Salary'].fillna(0, inplace=True)
print(df_filled)

print("\nFilling missing salary with the mean:")
df_filled2 = df.copy()
df_filled2['Salary'].fillna(df['Salary'].mean(), inplace=True)
print(df_filled2)

print("\nDropping rows with missing values:")
df_dropped = df.copy()
df_dropped.dropna(inplace=True)
print(df_dropped)

- Basic Descriptive Statistics

In [None]:
print("\n# Basic Descriptive Statistics")
print(df.describe())



- Grouping

In [None]:
print("Mean salary by city")
print(df.groupby('City')['Salary'].mean())
print("Median salary by city")
print(df.groupby('City')['Salary'].median())
print("Max salary by city")
print(df.groupby('City')['Salary'].max())
print("Min salary by city")
print(df.groupby('City')['Salary'].min())

- Sorting

In [None]:
print("Sorting by age")
print(df.sort_values(by='Age'))
print("\nSorting by age descending")
print(df.sort_values(by='Age', ascending=False))

## Numpy

- NumPy: Numerical Computing in Python
 - NumPy is the fundamental package for numerical computation in Python. It provides powerful tools for working with arrays and matrices.


### Creating NumPy Arrays
- NumPy arrays are homogeneous (all elements have the same data type) and are more efficient than Python lists for numerical operations.


In [None]:
# From a Python list
print("\n# From a Python list:")
python_list = [1, 2, 3, 4, 5]
np_array = np.array(python_list)
print(np_array)
print(type(np_array))

In [None]:
# Using np.arange()
print("\n# Using np.arange():")
array1 = np.arange(10)  # 0 to 9
print(array1)
array2 = np.arange(5, 15)  # 5 to 14
print(array2)
array3 = np.arange(0, 20, 2)  # 0 to 18, step of 2
print(array3)

In [None]:
# Using np.linspace()
print("\n# Using np.linspace():")
array4 = np.linspace(0, 1, 5)  # 5 evenly spaced numbers from 0 to 1 (inclusive)
print(array4)

In [None]:
# Using np.zeros(), np.ones(), np.empty()
print("\n# Using np.zeros(), np.ones(), np.empty():")
zeros_array = np.zeros((3, 4))  # 3x4 array of zeros
print("Zeros array:\n",zeros_array)
ones_array = np.ones((2, 2))  # 2x2 array of ones
print("\nOnes array:\n", ones_array)
empty_array = np.empty((2, 3)) #creates an uninitialized array of given shape
print("\nEmpty array:\n", empty_array)

### Array Attributes

- NumPy arrays have useful attributes like shape, dtype, and ndim.

In [None]:
array = np.array([[1, 2, 3], [4, 5, 6]])
print("Array:\n", array)
print("Shape:", array.shape)  # (rows, columns)
print("Data type:", array.dtype)
print("Number of dimensions:", array.ndim)

### Array Indexing and Slicing
- Similar to Python lists, but with more powerful slicing capabilities.

In [None]:

array = np.array([10, 20, 30, 40, 50])
print("Array:", array)
print("Element at index 2:", array[2])
print("Slice from index 1 to 4:", array[1:4])
print("Slice from the beginning to index 3:", array[:3])
print("Slice from index 2 to the end:", array[2:])

In [None]:
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\n2D array:\n", array_2d)
print("Element at row 1, column 2:", array_2d[1, 2]) #accessing element at row 1, column 2
print("Row 0:", array_2d[0, :]) #accessing row 0
print("Column 1:", array_2d[:, 1]) #accessing column 1
print("Subarray (top-left 2x2): \n", array_2d[:2, :2])

### Array Operations

- NumPy allows element-wise operations on arrays.

In [None]:
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

print("Array 1:", array1)
print("Array 2:", array2)

print("Addition:", array1 + array2)
print("Subtraction:", array1 - array2)
print("Multiplication:", array1 * array2)
print("Division:", array1 / array2)
print("Scalar multiplication:", array1 * 3)
print("Exponentiation:", array1 ** 2)

### Universal Functions (ufuncs)

- NumPy provides many universal functions that operate element-wise on arrays.

In [None]:

array = np.array([1, 4, 9, 16])
print("Array:", array)
print("Square root:", np.sqrt(array))
print("Exponential:", np.exp(array))
print("Logarithm:", np.log(array))

### Aggregation Functions

- NumPy provides functions for performing aggregations like sum, mean, min, max, etc.

In [None]:

array = np.array([1, 2, 3, 4, 5])
print("Array:", array)
print("Sum:", np.sum(array))
print("Mean:", np.mean(array))
print("Min:", np.min(array))
print("Max:", np.max(array))
print("Standard deviation:", np.std(array))


### Reshaping Arrays


In [None]:

array = np.arange(12)
print("Original Array:", array)
reshaped_array = array.reshape(3, 4) #reshapes array to 3 rows and 4 columns
print("Reshaped Array:\n", reshaped_array)
flattened_array = reshaped_array.flatten() #flattens the array to 1 dimension
print("Flattened Array:", flattened_array)

## Matplotlib: Data Visualization in Python
- Matplotlib is a powerful library for creating static, interactive, and animated visualizations in Python. It provides a wide range of plotting functions for various data types.

In [None]:
import matplotlib.pyplot as plt

### Basic Plotting
- The simplest way to create a plot is using the plot() function.

In [None]:
# Sample data
x = np.linspace(0, 10, 100)  # Create 100 evenly spaced points between 0 and 10
y = np.sin(x)

# Code Example:
plt.plot(x, y)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Sine Wave Plot")
plt.show()


### Customizing Plots
- Matplotlib provides extensive options for customizing plots, including line styles, colors, markers, labels, titles, and more.

In [None]:
# Code Example:
plt.plot(x, y, color='red', linestyle='--', marker='o', label='sin(x)') #added label for legend
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Customized Sine Wave Plot")
plt.legend() #show the legend
plt.show()

### Multiple Plots on the Same Axes

- You can plot multiple datasets on the same axes by calling the plot() function multiple times.

In [None]:
# Code Example:
y2 = np.cos(x)
plt.plot(x, y, label='sin(x)')
plt.plot(x, y2, label='cos(x)')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Sine and Cosine Wave Plot")
plt.legend()
plt.show()


### Subplots

- Subplots allow you to create multiple plots within the same figure.

In [None]:
# Code Example:
fig, axes = plt.subplots(2, 1, figsize=(8, 6))  # 2 rows, 1 column of subplots, figsize sets the figure size
axes[0].plot(x, y)
axes[0].set_title("Sine Wave")
axes[1].plot(x, y2, color='orange')
axes[1].set_title("Cosine Wave")
plt.tight_layout() #adjusts subplot params so that subplots fit in to the figure area.
plt.show()

### Scatter Plots

- Scatter plots are used to visualize the relationship between two variables.


In [None]:
# Sample Data
x = np.random.rand(50)
y = np.random.rand(50)
colors = np.random.rand(50)
sizes = (100 * np.random.rand(50))**2

# Code Example:
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5, cmap='viridis') #c is for color, s is for size, alpha is for transparency, cmap is for colormap
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Scatter Plot")
plt.colorbar() #show the colorbar
plt.show()

### Bar Charts

- Bar charts are used to compare categorical data.

In [None]:
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [25, 15, 30, 10]

# Code Example:
plt.bar(categories, values)
plt.xlabel("Categories")
plt.ylabel("Values")
plt.title("Bar Chart")
plt.show()

### Histograms

- Histograms are used to visualize the distribution of a single numerical variable.

In [None]:
# Sample data
data = np.random.normal(0, 1, 1000)  # 1000 random numbers from a normal distribution

# Code Example:
plt.hist(data, bins=30, edgecolor='black') #bins adjusts how many bars are shown
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram")
plt.show()

### Pie Charts

- Pie charts are used to show the proportions of different categories in a whole.

In [None]:

# Sample data
labels = ['Frogs', 'Hogs', 'Dogs', 'Logs']
sizes = [15, 30, 45, 10]
explode = (0, 0.1, 0, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')

# Code Example:
plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) #autopct shows the percentage, shadow adds a shadow effect, startangle starts the first slice at 90 degrees
plt.title("Pie Chart")
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

## Seaborn: Statistical Data Visualization
- Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for creating attractive and informative statistical graphics.


In [None]:
import seaborn as sns

In [None]:
# Sample Data (using a Pandas DataFrame)
np.random.seed(42) #for reproducibility
data = {'Category': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B', 'C', 'A'],
        'Value': np.random.rand(10),
        'Group': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'Y', 'X', 'Y', 'X']}
df = pd.DataFrame(data)

### Scatter Plots

- Seaborn's scatterplot() is a more advanced version of Matplotlib's scatter(). It can handle grouping and other statistical features more easily.


In [None]:

# Code Example:
sns.scatterplot(x='Category', y='Value', data=df, hue='Group', style='Group', s=100) #hue and style add different colors and markers for different groups
plt.title("Scatter Plot with Seaborn")
plt.show()

### Line Plots

- Seaborn's lineplot() is used to visualize trends over continuous data.

In [None]:
# Sample Data
x = np.linspace(0, 10, 100)
y = np.sin(x)
df_line = pd.DataFrame({'x': x, 'y': y})

# Code Example:
sns.lineplot(x='x', y='y', data=df_line)
plt.title("Line Plot with Seaborn")
plt.show()

### Bar Plots

- Seaborn's barplot() is used to compare aggregated values (like means) across different categories.

In [None]:

# Code Example:
sns.barplot(x='Category', y='Value', data=df)
plt.title("Bar Plot with Seaborn")
plt.show()

### Count Plots

 Seaborn's countplot() is used to show the counts of observations in each categorical bin.


In [None]:

# Code Example:
sns.countplot(x='Category', data=df)
plt.title("Count Plot with Seaborn")
plt.show()

### Histograms (Distributions)

- Seaborn's histplot() or displot() (more flexible) is used to visualize the distribution of a single numerical variable.


In [None]:

# Code Example:
sns.histplot(x='Value', data=df, kde=True) #kde adds a kernel density estimate line
plt.title("Histogram with Seaborn")
plt.show()


### Box Plots

- Seaborn's boxplot() is used to show the distribution of a numerical variable across different categories.

In [None]:

# Code Example:
sns.boxplot(x='Category', y='Value', data=df, hue='Group')
plt.title("Box Plot with Seaborn")
plt.show()


### Violin Plots

- Seaborn's violinplot() is similar to a box plot but provides a more detailed view of the distribution.

In [None]:

# Code Example:
sns.violinplot(x='Category', y='Value', data=df, hue='Group', split=True) #split combines the two violin plots into one when hue is used.
plt.title("Violin Plot with Seaborn")
plt.show()

### Heatmaps

- Heatmaps are used to visualize matrix data or correlations between variables.

In [None]:
# Sample data (correlation matrix)
correlation_matrix = df[['Value']].corr()

# Code Example:
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') #annot shows the correlation values on the heatmap, cmap changes the colormap
plt.title("Heatmap with Seaborn")
plt.show()

### Pair Plots

- Seaborn's pairplot() creates a matrix of scatter plots showing the relationships between all pairs of variables in a DataFrame.


In [None]:

# Sample Data (adding a new column to make it more interesting)
df['Value2'] = np.random.rand(10)
# Code Example:
sns.pairplot(df, hue='Category')
plt.title("Pair Plot with Seaborn")
plt.show()

 ### Plotly: Interactive Data Visualization

- Plotly is a modern data visualization library that allows you to create interactive plots and dashboards. It's particularly well-suited for web applications and presentations.")


In [None]:
import plotly.express as px
import plotly.graph_objects as go


In [None]:
# Sample Data (using a Pandas DataFrame)
np.random.seed(42)
data = {'Category': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B', 'C', 'A'],
        'Value': np.random.rand(10),
        'Group': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'Y', 'X', 'Y', 'X']}
df = pd.DataFrame(data)

### Scatter Plots (plotly.express)

- plotly.express (px) provides a high-level interface for creating common plot types with minimal code.


In [None]:

# Code Example:
fig = px.scatter(df, x='Category', y='Value', color='Group', symbol='Group', title="Scatter Plot with Plotly Express") #symbol adds different marker shapes
fig.show()

### Scatter Plots (plotly.graph_objects)

- plotly.graph_objects (go) provides more control over the plot's appearance and layout.

In [1]:

fig = go.Figure(data=go.Scatter(x=df['Category'], y=df['Value'], mode='markers', marker=dict(color=df['Value'], size=10)))
fig.update_layout(title="Scatter Plot with Plotly Graph Objects")
fig.show()


NameError: name 'go' is not defined

### Line Plots (plotly.express)


In [None]:

x = np.linspace(0, 10, 100)
y = np.sin(x)
df_line = pd.DataFrame({'x': x, 'y': y})

fig = px.line(df_line, x='x', y='y', title='Line Plot with Plotly Express')
fig.show()

### Bar Charts (plotly.express)

In [None]:

fig = px.bar(df, x='Category', y='Value', color='Group', title="Bar Chart with Plotly Express")
fig.show()

### Histograms (plotly.express)


In [None]:

fig = px.histogram(df, x='Value', color='Category', title="Histogram with Plotly Express") #color adds different colors for the different categories
fig.show()

### Box Plots (plotly.express)


In [None]:

fig = px.box(df, x='Category', y='Value', color='Group', title="Box Plot with Plotly Express")
fig.show()

### Violin Plots (plotly.express)


In [None]:


fig = px.violin(df, x='Category', y='Value', color='Group', box=True, points="all", title="Violin Plot with Plotly Express") #box shows the boxplot inside the violinplot, points shows the actual data points
fig.show()

### Heatmaps (plotly.graph_objects)


In [None]:

correlation_matrix = df[['Value']].corr()
fig = go.Figure(data=go.Heatmap(z=correlation_matrix.values, x=correlation_matrix.index, y=correlation_matrix.columns, colorscale='Viridis', zmin=-1, zmax=1)) #set zmin and zmax to ensure the color scale is consistent across different correlation matrices.
fig.update_layout(title="Heatmap with Plotly Graph Objects")
fig.show()

### 3D Scatter Plots (plotly.express)


In [None]:

#Sample Data
df['Value2'] = np.random.rand(10)
fig = px.scatter_3d(df, x='Category', y='Value', z='Value2', color='Group', title="3D Scatter Plot with Plotly Express")
fig.show()

### Subplots (plotly.subplots)


In [None]:

from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2, subplot_titles=('Scatter Plot', 'Bar Chart'))

fig.add_trace(go.Scatter(x=df['Category'], y=df['Value'], mode='markers'), row=1, col=1)
fig.add_trace(go.Bar(x=df['Category'], y=df['Value']), row=1, col=2)

fig.update_layout(title_text="Subplots with Plotly")
fig.show()

© Caleb Okon 2024

Machine Learning Tutorials