# Title

### Goals
- Define achievable goals here

### Data Sources
- Document data sources with [links]()

## Setup and Library Imports

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
# from sklearn.compose import TransformedTargetRegressor
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.preprocessing import RobustScaler


In [None]:
# Display entire dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [None]:
# Set matplotlib theme to match seaborn
sns.set()

In [None]:
# cd 'Correct file path'

### Load Data

In [None]:
df = pd.read_csv('path') #, index_col=0)
df.head()

### Data Pre-Processing
- Ensure columns are correctly named and formatted
- Check data types and null values

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
# # Common pre-processing techniques
# df.columns = df.columns.str.lower()
# df.rename(columns={'old_col': 'new_col'})
# df['col'] = pd.to_numeric(df['col'], downcast='integer')
# df.astype({'col1': 'int32'})
# df.replace('\$', '', regex=True)

In [None]:
# Export pre-processed data
# df.to_csv('path')

### Data Scaling

#### Why scale data?
It helps your model identify patterns and helps you avoid errors that prevent your model from converging. This is especially important for regression tasks. Don't forget to scale your target variable in that situation!

#### Classification / Non-Regression Tasks
Use a scaler object for the numerical predictors <br>
General Guidelines:
- StandardScaler for a variable that follows a relatively normal distribution
- RobustScaler for a variable whose quite median is different from its mean
- MinMaxScaler to scale between 0 and 1
<br>
More information in this [article](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02)

In [None]:
# Classification Example with MinMaxScaler
# scaled_data = MinMaxScaler().fit_transform(unscaled_data)
# scaled_df = DataFrame(scaled_data, columns=unscaled_col_names_list)
# scaled_df.head()

#### Regression Tasks
Things get complicated if you scale your data in one notebook and execute your model in another. You won't have the same object available to inverse transform your results. The cell below returns the target variable in inverse-scaled form so that you get it in the correct units <br>
**Example:** If I'm looking to predict a dollar amount with Robust-Scaled data, I might get a result of 0.17 from my scaled model. I'll need to convert that to understand it in terms of dollars, but my scaler object is in a different notebook. <br>
**TransformedTargetRegressor** can help us with that! See the [documentation](https://scikit-learn.org/stable/modules/compose.html#transforming-target-in-regression) here!

In [None]:
# Regression Example with LinearReg() and RobustScaler()
# reg = TransformedTargetRegressor(regressor=LinearRegression(),
#                              transformer=RobustScaler())

### Data Visualization
- Histograms/Violin Plots for numerical distributions
- Barplots for categorical distributions
- Pairplots/scatterplots for related variables
- Hues for categorical separation

In [None]:
# Separate columns into subsets
numerical = []
categorical = []
target = []

In [None]:
# Plot ideas
sns.histplot(data=df, x = feature)
sns.pairplot(numerical_subset1, kind = 'reg', hue='category')
sns.relplot(x="valence", y="loudness", hue="mode", data=df)
sns.barplot(x='class', y='percent', hue = 'mode', data = df)
sns.catplot(x='class', y=col, kind = 'violin', data = fob_songs)

### Summary Data

In [None]:
# Final check for null values
df.isnull().sum().sum()

### Correlational Analysis

In [None]:
correlations = df.corr()
correlations[abs(correlations) >= 0.5]

In [None]:
df.describe()

### Conclusion
- Did we achieve goals?
- What insights did we discover?
- What are the next steps?