# Structured Data Prep Notebook
## Data Source: ____
This workbook uses a series of standard and configurable python code recipes to take a **structured** data source and prepare it for machine learning/data mining. <br>
This workbook assumes some problem qualification and Exploratory Data Analysis on the data has already taken place. <br>
The goal of this workbook is to provide a process and a starting point, so an analyst can spend less time prepping data, and more time fitting models! <br>

**To do:**
- Consider executing categorical feature encoding before handling null variables
- HIGHLIGHT WHERE DATA ENTRY IS NEEDED (RED)
- HIGHLIGHT CANDIDATE CELLS FOR PIPELINE GEN (#pipeline)
<br>

## Contents
<a id='Section0'></a>
[1. Setup: Goal for data prep](#Section1) <br>
[2. Setup: Library and Data Source Import](#Section2) <br>
[3. Setup: Identifying Cleaning Challenges](#Section3) <br>
[4. Setup: Tidying and Deduplication](#Section4) <br>
[5. Cleaning: Missing Value Management](#Section5) <br>
[6. Cleaning: Numerical Features](#Section6) <br>
[7. Cleaning: Boolean Features](#Section7) <br>
[8. Cleaning: Date Features](#Section8) <br>
[9. Cleaning: Categorical Features](#Section9) <br>
[10. Cleaning: Other Features](#Section10) <br>
[11. Evaluation of Prepared Data](#Section11) <br>
[12. Construction of a Prep Pipeline](#Section12) <br>
[13. References](#Section13) <br>


### A note on Data Prep Processes
Many experts recommend following a routine or process when prepping data, as it saves time and encourages best practice. <br>
This workbook is follows the below process, which is further [documented here](http://bit.ly/2HOA18w) <br>
<img align="left" width="500" height="1000" src="images/Data Cleaning Process V1_Shared - Process Draft One.png">

## 1. Setup: Goal for data prep
<a id='Section1'></a>
[Go back to contents](#Section0) <br>
### State here the goal of your data preparation

Your goal goes here. <br>
Quite a few decisions will be informed by this goal in this notebook - so it's important to have one, and to state it up front.

## 2. Setup: Library and Data Source Import
<a id='Section2'></a>
[Go back to contents](#Section0) <br>
This section installs and imports some useful libraries for data prep, and provides some options on connecting with your data source. <br>
**Note - if your data is not structured, you'll need to do that first**

In [None]:
# # Installing libraries if needed
# import sys
# # This workbook uses the Pandas Profiling library - you can install it on your local system using this code
# !{sys.executable} -m pip install pandas_profiling
# # This workbook may also use the fuzzywuzzy string matching library, which can be installed as follows
# !{sys.executable} -m pip install fuzzywuzzy
# ! {sys.executable} -m pip install python-Levenshtein
# # This workbook's EXAMPLES uses the geopy library
# !{sys.executable} -m pip install geopy

In [None]:
%%time
# This workbook uses the following libraries
import sys
import numpy as np
import pandas as pd
pd.set_option('display.max_rows',100)
pd.set_option('display.max_columns', 200) # Good for wide datasets - otherwise it will truncate the data in views like head
import pandas_profiling # for exploration of datasets

# for missing value processing
from sklearn.preprocessing import Imputer

# for numeric processing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from mlxtend.preprocessing import minmax_scaling
# for Box-Cox Transformation
from scipy import stats

# for text processing
import re 
import string
import fuzzywuzzy
from fuzzywuzzy import process

# Setting the seed for reproducibility
np.random.seed(42)

In [None]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

In [None]:
# importing data
filepath_or_url = ""
data_raw = pd.read_csv(filepath_or_url,low_memory=False) # import CSV
# raw_data = pd.read_excel(filepath,sheet_name="") # import XLSX

In [None]:
# validating import
data_raw

#### Import references:
- [Pandas Input/Output Functions](https://pandas.pydata.org/pandas-docs/stable/api.html#input-output)

## 3. Setup: Identifying Cleaning Challenges
<a id='Section3'></a>
[Go back to contents](#Section0) <br>
This section includes:
- Understanding default data types and columns
- Understanding nulls
- Using the Pandas Profiler to deep dive into variables
- Evaluating some of the challenges

In [None]:
# Evaluate column names, types, nulls, using info.
data_raw.info()

In [None]:
# Evaluate numerical and categorical top line items using describe
data_raw.describe(include=[np.number]).T)

In [None]:
# Evaluate object and categorical items using describe - are they high in cardinality?!?
data_raw.describe(include=[np.object,pd.Categorical]).T

In [None]:
# Create a "pandas profile report" to enable efficient deep dives into all features.
report = pandas_profiling.ProfileReport(data_raw)
report

In [None]:
# You can export the profile report if needed
report.to_file(outputfile="raw_data_profile.html")

In [None]:
# You can also see what variables the report recommends excluding based upon a correlation with other variables >0.9
report.get_rejected_variables()

In [None]:
# finally matplot.lib and Boolean Indexing can help deep dive on fields with strange distributions
# data_raw[''].dropna().hist(bins=100)

**Notes:** <br>
*Your notes here*

## 4. Setup: Tidying and Deduplication
<a id='Section4'></a>
[Go back to contents](#Section0) <br>
This section involves:
- Standardizing column names.
- Evaluating whether each row is a sample, and what transforms are needed to bring this about.
- Testing for duplicates, and Identifying an index for each sample (where possible)
- Sorting columns into the following **data types**: numerical, boolean, date, categorical, and other.

In [None]:
# As per Chris Albion, it is best practice to treat a data frame as immutable, and to copy before manipulation (to protect against mistakes)

# Copying the raw data file:
data_prep_1 = data_raw.copy()

# Standardizing column names to snake case:
data_prep_1.columns = [c.replace(' ', '_') for c in data_prep_1.columns]
data_prep_1.columns =  [c.lower() for c in data_prep_1.columns]
data_prep_1.columns = [re.sub(r'\W+','_',c) for c in data_prep_1.columns]

# Reporting the resulting columns as a list for later reference
data_prep_1.columns.tolist()

In [None]:
# Evaluating the shape of the dataframe, and whether each row is a sample and each column is a variable
# A random sample is thought to be a good way to do this.
data_prep_1.sample(20)

**Notes on tidiness evaluation:** <br>
*Your notes here*

In [None]:
# testing for duplicates - first across all features
data_prep_1[data_prep_1.duplicated()]

In [None]:
# testing for duplicates - then across index candidate
index = ""
data_prep_1[data_prep_1.duplicated(index)]

In [None]:
# Removing a duplicate from the index
data_prep_2 = data_prep_1.drop_duplicates(subset=index, keep='first')

# Optional if the previous step is not applicable
# data_prep_2 = data_prep_1.copy()

In [None]:
# Reindexing on the new variable
data_prep_3 = data_prep_2.set_index(index) # note this will delete the the original record

In [None]:
# Creating a list of irrelevant columns based on the profile report and other observations
manual_exclude_list = []
cols_exclude_total = manual_exclude_list
cols_exclude_total

**Justification for manually excluded columns:** <br>
*Your notes here*

In [None]:
data_prep_3['located_in'].unique()

In [None]:
# Sorting columns - defining the logic for the sort:
def feature_sort(cols_num,cols_bool,cols_date,cols_cat,cols_other,cols_exclude_total):
    for col in data_prep_3.columns:
        if col not in cols_exclude_total + cols_num + cols_bool + cols_date + cols_cat + cols_other:
            if col in data_prep_3.columns[(data_prep_3.dtypes == np.float64) | (data_prep_3.dtypes == np.float32)]:
                cols_num.append(col)
            elif (data_prep_3[col].nunique() == 2) or ("true" in data_prep_3[col].unique()) or ("false" in data_prep_3[col].unique()) \
            or ("yes" in data_prep_3[col].unique()) or ("no" in data_prep_3[col].unique()):
                cols_bool.append(col)
            elif 'date' in str(col):
                cols_date.append(col)
            elif data_prep_3[col].nunique() < data_prep_3.shape[0]/100: # Arbitrary limit
                cols_cat.append(col)
            else:
                cols_other.append(col)
    return cols_num,cols_bool,cols_date,cols_cat,cols_other

In [None]:
# exceptions can be handled by placing their values in the column names before executing the for loop
cols_num = []
cols_bool = []
cols_date = []
cols_cat = []
cols_other = []

In [None]:
# Running the sort
cols_num,cols_bool,cols_date,cols_cat,cols_other = feature_sort(cols_num,cols_bool,cols_date,cols_cat,cols_other,cols_exclude_total)

In [None]:
# Evalaute the results of the sort
[print(key+" features:",value,sep='\n') for key,value in {"cols_num":cols_num,"cols_bool":cols_bool,"cols_date":cols_date,\
                                          "cols_cat":cols_cat,"cols_other":cols_other}.items()]

In [None]:
# Removing Irrelevant features
data_prep_4 = data_prep_3.drop(labels=cols_exclude_total,axis = 1)
data_prep_4.columns

#### References for section 4: 
##### Pandas Data Wrangling Functions
[Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) <br>
*Credit - Irv Lustig, Princeton Consultants*
##### Tidy Data
**Image:** Definition of variables, observations, and values in Tidy Data <br>
*Image Credit - R for Data Science (Hadley Wickham & Garrett Grolemund)* <br>
<img align="left" width="500" height="500" src="images/tidy data.png">

## 5. Cleaning: Missing Value Management
<a id='Section5'></a>
[Go back to contents](#Section0) <br>
This section covers missing value management, and helps makes decisions on whether to drop values, impute values, or handle them otherwise.
- Ensure we have detected all nans
- Evaluate how many nans we have
- Evaluate how important a feature is
- Evaluate why the values are NaN - were the not recorded, or do they not exist?
- Flag values for imputation during analysis
- Drop values for the unimportant columns

In [None]:
# Check category features for "nulls" hiding behind other values (a common gotcha!) 
[print(str(c)+' value counts'\
       ,data_prep_4[c].value_counts()\
       ,sep="\n") for c in cols_cat]

In [None]:
# Build a replacement dict of column names, and values, to replace with NaN's
replace_dict = {"column":{"value":np.nan},"column":{"value":np.nan}}

# Replace "unknown" values with nans
data_prep_5 = data_prep_4.replace(to_replace=replace_dict)

# Optional if the previous step is not applicable
# data_prep_5 = data_prep_4.copy()

In [None]:
# List columns with missing values by %
data_prep_5.isnull().sum()\
    .apply(lambda x: (x/data_prep_4.shape[0])*100)\
    .sort_values(ascending=False)

In [None]:
# Assess impact of dropping all samples with missing values
print("rows before drop: " + str(data_prep_5.shape[0])\
      ,"rows after drop: " + str(data_prep_5.dropna().shape[0])\
      ,sep="\n")

In [None]:
# Are any missing features unimportant? If so, note them down and drop them
cols_missing =[] # note them in this list
cols_exclude_total = cols_exclude_total.append(cols_missing)
data_prep_6=data_prep_5.drop(labels=cols_missing,axis=1)

In [None]:
# Are the values for the sample never recorded, or do they not exist?
# This can be determined by reading the docs or through EDA
not_recorded = []
dont_exist = []

In [None]:
# Leave not_recorded imputation for later (as only applies to numeric features, and can be impacted horrendously by outliers).

In [None]:
# Drop samples with dont_exist now
print(data_prep_6.shape[0])
data_prep_7 = data_prep_6.dropna(subset=dont_exist)
print(data_prep_7.shape[0])

In [None]:
# Resort cols to adjust to anything dropped in this section
# Running the sort
cols_num,cols_bool,cols_date,cols_cat,cols_other = feature_sort(cols_num,cols_bool,cols_date,cols_cat,cols_other,cols_exclude_total)
# Evalaute the results of the sort
[print(key+" features:",value,sep='\n') for key,value in {"cols_num":cols_num,"cols_bool":cols_bool,"cols_date":cols_date,\
                                          "cols_cat":cols_cat,"cols_other":cols_other}.items()]

In [None]:
# Note that there are other approaches that can be effective,
# like fill_na (e.g. replace all NaNs with zero's)

#### References for section 5
[Imputing Missing Values with Means](https://chrisalbon.com/machine_learning/preprocessing_structured_data/impute_missing_values_with_means/) <br>
[Handling Missing Values by Dan B](https://www.kaggle.com/dansbecker/handling-missing-values)

## 6. Cleaning: Numerical Features
<a id='Section6'></a>
[Go back to contents](#Section0) <br>
This section involves <br>
- Outlier Management
- Feature Transformation (e.g. rescaling)

In [None]:
# Detecting outliers through Interquartile ranges

# Defining a function to return the range and count of outliers
# based on distance to interquartile ranges (<1.5*Q1, >1.5*Q3)

def bounds_number_of_outliers(x): 
    q1, q3 = np.percentile(x, [25, 75]) 
    iqr = q3 - q1 
    lower_bound = q1 - (iqr * 1.5) 
    upper_bound = q3 + (iqr * 1.5) 
    return lower_bound, upper_bound, len(np.where((x > upper_bound) | (x < lower_bound))[0])

In [None]:
print(str(c)+" outlier lower bound, upper bound, and count:", bounds_number_of_outliers(data_prep_7[c].dropna())\
       ,sep="\n")\
 for c in cols_num]

**Notes on outliers:** <br>
Your notes here.

In [None]:
# To manage outliers, we can remove them, mark them, or transform them

# removing them
# data_prep_8[] = data_prep_7[(data_prep_7[''] < lower_bound) |\
# (data_prep_7[''] > upper_bound)]

# Remove excluded features from columns
# cols_exclude_total = cols_exclude_total+cols_missing
# for l in [cols_num,cols_bool,cols_date,cols_cat,cols_other]:
#     for c in l:
#         if  c in l:
#             l.remove(c)

# marking them
# data_prep_8['_outlier'] = np.where(data_prep_7[''] > upper_bound)

# transforming them
# data_prep_8 = data_prep_7.copy()
# data_prep_8["log_of_diameter_breast_height"] = \
# data_prep_8["diameter_breast_height"].apply(lambda x: np.log(x))


# or something else
data_prep_8 = data_prep_7.copy()

In [None]:
# Reference (if needed) evaluating distributions of numeric values
# data_prep_7[''].dropna().hist(bins=100)

In [None]:
# Feature transformation (i.e. rescaling) 

# The standard approach is minmax scaling. Note that this does not handle null values.
data_prep_9 = data_prep_8.copy()

for c in cols_num:
    data_prep_9[c].dropna(subset = [c], inplace=True)
    data_prep_9[c] = minmax_scaling(data_prep_9[c],columns = [0])

# Chris Albion recommends defaulting to standardization unless the model 
# demands otherwise

# def scaler(x):
# # Create scaler scaler = preprocessing.StandardScaler() # Transform the feature standardized = scaler.fit_transform( x) # Show feature standardized
#     scaler = StandardScaler()
#     return scaler.fit_transform(x)


#### References for Section 6
[Min Max Scaling on Kaggle](https://www.kaggle.com/rtatman/data-cleaning-challenge-scale-and-normalize-data) <br>
Outliers function partially credited to Albon, Chris. Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning (p. 70). O'Reilly Media. Kindle Edition. <br>

## 7. Cleaning: Boolean Features
<a id='Section7'></a>
[Go back to contents](#Section0) <br>
Involves encoding Booleans.

In [None]:
# Review bool cols values
[print(c, data_prep_9[c].value_counts(), sep="\n") for c in cols_bool] 

In [None]:
# create a replacement dict
replace_bool = {col1:{value1:1,value2:0}}

In [None]:
# Replace values for ints to encode bool
data_prep_10 = data_prep_9.copy()
data_prep_10 = data_prep_10.replace(to_replace=replace_bool)

In [None]:
# Review transformed bool cols values
[print(c, data_prep_10[c].value_counts(), sep="\n") for c in cols_bool] 

## 8. Cleaning: Date Features
<a id='Section8'></a>
[Go back to contents](#Section0) <br>
This section involves
- Date Encoding
- Date Feature Generation

In [None]:
# Review date cols
cols_date

In [None]:
%%time
# Date encoding - note that this can be very slow, so it sometimes can be worthwhile specifying the datetime format
data_prep_11 = data_prep_10.copy()

for col in cols_date:
    data_prep_11[col] = pd.to_datetime(data_prep_11[col], infer_datetime_format = True)

In [None]:
# Date feature generation - for a tidy dataset, it can make sense to break out a date feature into week, month, and year. 
data_prep_11['year_'] = data_prep_11[''].dt.year
data_prep_11['month_'] = data_prep_11[''].dt.month
data_prep_11['week_'] = data_prep_11[''].dt.week

## 9. Cleaning: Categorical Features
<a id='Section9'></a>
[Go back to contents](#Section0) <br>
- Standardization
- Cardinality Restriction
- Encoding

In [None]:
# Reviewing Categorical Features and Values
[print(col,data_prep_11[col].value_counts(),sep="\n") for col in cols_cat]

In [None]:
# Standardizing all text in categorical columns to protect against data entry errors
punct_reg = re.compile('[%s+]' % re.escape(string.whitespace + string.punctuation))
def text_proc(text):
    proc = str(text)
    proc = proc.lower() #changes case to lower
    proc = proc.strip() #removes leading and trailing spaces/tabs/new lines
    proc = punct_reg.sub('_', proc)
    return proc

In [None]:
%%time
data_prep_12=data_prep_11.copy()
for col in cols_cat:
    data_prep_12[col] = data_prep_12[col].apply(lambda x: text_proc(x))

In [None]:
# Evaluating after transformation
[print(col,data_prep_12[col].value_counts(),sep="\n") for col in cols_cat]

In [None]:
# Encoding categorical features
# Ordinal categories can be handled through replace
scale_mapper = {'new':1,
                'juvenile':2,
                'semi_mature':3,
                'mature':4,
                'over_mature':5
               }
data_prep_12[''] = data_prep_12[''].replace('scale_mapper')

In [None]:
# Nominal categories can be handled through one hot encoding or dummification
data_prep_13 = pd.get_dummies(data_prep_12, prefix = None, prefix_sep = '-', dummy_na = False, columns = [''])

In [None]:
data_prep_13.columns.tolist()

## 10. Cleaning: Other Features
<a id='Section10'></a>
[Go back to contents](#Section0) <br>
Section for ad hoc or specialised cleaning.

In [None]:
# I.e. for Geo features - obtaining postcode for each tree
# Though there is a limit of 1 search per second!
import sys
!{sys.executable} -m pip install geopy

In [None]:
from geopy.geocoders import Nominatim
geolocator = Nominatim()

In [None]:
result = geolocator.reverse("-37.794463412577585, 144.93192049089112")

In [None]:
result.raw['address']['postcode']

In [None]:
result.raw['address'].keys()

## 11. Evaluation of Prepared Data
<a id='Section11'></a>
[Go back to contents](#Section0) <br>
This section involves
- profiling the cleaned data
- Managing any remaining warnings
- Saving the dataset in the HDF5 format
- noting things to change based on the feedback from modelling (i.e. the MSE of estimators)

In [None]:
# Profile cleaned data
report2 = pandas_profiling.ProfileReport(data_prep_13)
report2

In [None]:
# Resolve warnings - this may involve imputation (some example code below)
data_prep_14 = data_prep_13.copy()
mean_imputer = Imputer(missing_values=np.nan, strategy='mean', axis=0)
mean_imputer_fit = mean_imputer.fit(data_prep_14[['diameter_breast_height_scaled','useful_life_expectency_value_scaled','age_description']])
data_prep_14_imputed = pd.DataFrame(data = mean_imputer_fit.transform(data_prep_14[['diameter_breast_height_scaled','useful_life_expectency_value_scaled','age_description']].values)\
            ,index=data_prep_14.index \
            ,columns=data_prep_14[['diameter_breast_height_scaled','useful_life_expectency_value_scaled','age_description']].columns)
data_prep_15 = \
pd.concat([data_prep_14_imputed,data_prep_14.drop(['diameter_breast_height_scaled','useful_life_expectency_value_scaled','age_description'],axis=1)],axis=1)

In [None]:
# Check all warnings are resolved
report3 = pandas_profiling.ProfileReport(data_prep_15)
report3

In [None]:
# Save data to HDF5 format to preserve formatting
%%time
# Saving data to HDF5 format
data_prep_15.to_hdf('data_prep_15.h5', key = 'data_prep_15', mode = 'w', append = True, format = 'table',\
          index = False, complib = 'blosc', optlevel = 9, data_columns = list(data_prep_15.index.names))

## 12. Construction of a Prep Pipeline
<a id='Section12'></a>
[Go back to contents](#Section0) <br>
Once data transformations are finalized, it makes sense to group them into a pipeline, to allow for reproducible file handling <br>
Functions are a great way to group this. <br>
Below is an example block from another project. The next version of this workbook will build this out. <br>
This workbook didn't touch much on dummification of categorical variables - so some code will be available for that as well. <br>

In [None]:
# Example functions - note how they're abstracted to handle multiple dataframes.
punct_reg = re.compile('[%s+]' % re.escape(string.whitespace + string.punctuation))

def text_proc(text):
    proc = str(text)
    proc = punct_reg.sub('_', proc)
    return proc

def data_processing(df):
    _data = df.copy()

    _num_cols = [col for col in _data.columns[(_data.dtypes == np.float64) | (_data.dtypes == np.float32)]]
    
    _date_cols = list(_data.columns[_data.columns.str.contains('date')])
    
    _text_cols = [col for col in _data.columns[_data.dtypes == 'object'] if col not in _date_cols]
    
    
    _bool_cols = [col for col in _text_cols if ('yes' in _data[col].values) or ('no' in _data[col].values) or
                  ('true' in _data[col].values) or ('false' in _data[col].values) ]
    
    for col in _num_cols:
        _data[col].fillna(0, inplace = True)
    
    for col in _date_cols:
        
        _data[col] = pd.to_datetime(_data[col], infer_datetime_format = True)
    
    for col in _text_cols:
        _data[col] = _data[col].str.lower().apply(lambda x: text_proc(x))
    
    for col in _bool_cols:
        _data[col] = _data[col].replace({ 'nan':0, 'no':0, 'yes':1, 'false':0, 'true':1})
        _data[col] = _data[col].astype(np.float32)
    
    return _data

def data_denoising(df):
    _data = df.copy()
    
    _date_cols = list(_data.columns[_data.columns.str.contains('date')])

    _text_cols = [col for col in _data.columns[_data.dtypes == 'object'] if col not in _date_cols]

    _bool_cols = [col for col in _text_cols if ('yes' in _data[col].values) or ('no' in _data[col].values) or
                  ('true' in _data[col].values) or ('false' in _data[col].values) ]
    
    feats_distros = dict()
    for c in _text_cols:
        _feat = _data[c]
        _feat = _feat.value_counts()
        _feat.fillna(0, inplace=True)
        _feat = _feat.astype(np.int32)
        _feat.sort_values(ascending = False, inplace = True)
        _feat = pd.DataFrame(_feat)
        _feat.columns = [c + ' count']
        _feat[c + ' distribution'] = 100*_feat/_feat.sum()
        feats_distros.update({c:_feat})

    for feat in _text_cols:
        feat_distro = feats_distros[feat][feat + ' distribution']
        feat_index = feat_distro[feat_distro < 1].index.tolist()
        len_feat_index = len(feat_index)
        if len_feat_index > 0:
            feat_sub = len_feat_index*[np.nan]
            feat_dict = dict(zip(feat_index, feat_sub))
            _data[feat] = _data[feat].replace(feat_dict)

    return _data

In [None]:
# Example function call
fy17fy18_event_2 = data_denoising(data_processing(fy17fy18_event))

In [None]:
# Example Dummification (for analysis of text categories)
%%time
merged_dummies = pd.get_dummies(merged, prefix = None, prefix_sep = '-', dummy_na = False, columns = merged_text_cols)

## 13. References
<a id='Section13'></a>
[Go back to contents](#Section0) <br>
- [Chris Albion’s blog](https://chrisalbon.com/) and [recently published book](https://www.amazon.com/Machine-Learning-Python-Cookbook-Preprocessing/dp/1491989386/ref=sr_1_1)
- [Theadore Petrou’s Pandas Cookbook](https://www.packtpub.com/big-data-and-business-intelligence/pandas-cookbook)
- Racheal Tatman’s [Data Cleaning Challenge](https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values) and [blog](http://www.rctatman.com/)
- Katharine Jarmul’s [Automating Data Cleanup with Python](https://www.youtube.com/watch?v=gp-ngPV_ZX8)
- Data Camp’s [Importing and Cleaning data with Python](https://www.datacamp.com/tracks/importing-cleaning-data-with-python)
- [Luis Daniel Lopez-Sanchez](https://www.linkedin.com/in/luis-daniel-l%C3%B3pez-s%C3%A1nchez-5b624864/)