# Exercise - clean and balance your data
The first task at hand, before starting this project, is to clean and balance your data to get better results.

# Delicious Asian and Indian Cuisines 


Install Imblearn which will enable SMOTE. This is a Scikit-learn package that helps handle imbalanced data when performing classification. (https://imbalanced-learn.org/stable/)

In [1]:
pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Collecting imbalanced-learn (from imblearn)
  Downloading imbalanced_learn-0.12.2-py3-none-any.whl.metadata (8.2 kB)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Downloading imbalanced_learn-0.12.2-py3-none-any.whl (257 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.0/258.0 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.12.2 imblearn-0.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import the packages you need to import your data and visualize it, also import SMOTE from imblearn

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from imblearn.over_sampling import SMOTE

In [None]:
df  = pd.read_csv('./cuisines.csv')

This dataset includes 385 columns indicating all kinds of ingredients in various cuisines from a given set of cuisines.

In [None]:
# check the data's shape

df.head()

In [None]:
# Get info about this data

df.info()

In [None]:
# import data

df.cuisine.value_counts()

# Learning about cusines

Show the cuisines in a bar graph

In [None]:
# Plot the data as bars by calling barh()

df.cuisine.value_counts().plot.barh()

In [None]:
# There are a finite number of cuisines, but the distribution of data is uneven.
# We can fix that! Before doing so, explore a little more.

# Find out how much data is available per cuisine and print it out:

thai_df = df[(df.cuisine == "thai")]
japanese_df = df[(df.cuisine == "japanese")]
chinese_df = df[(df.cuisine == "chinese")]
indian_df = df[(df.cuisine == "indian")]
korean_df = df[(df.cuisine == "korean")]

print(f'thai df: {thai_df.shape}')
print(f'japanese df: {japanese_df.shape}')
print(f'chinese df: {chinese_df.shape}')
print(f'indian df: {indian_df.shape}')
print(f'korean df: {korean_df.shape}')

## What are the top ingredients by class

We can dig deeper into the data and learn what are the typical ingredients per cuisine.
We should clean out recurrent data that creates confusion between cuisines.

In [None]:
# Create a function create_ingredient() in Python to create an ingredient dataframe
# This function will start by dropping an unhelpful column and sort through ingredients by their count

def create_ingredient_df(df):
    # transpose df, drop cuisine and unnamed rows, sum the row to get total for ingredient and add value header to new df
    ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
    # drop ingredients that have a 0 sum
    ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
    # sort df
    ingredient_df = ingredient_df.sort_values(by='value', ascending=False, inplace=False)
    return ingredient_df


In [None]:
# We can use create_ingredient_df function to get an idea of top ten most popular ingredients by cuisine
# Call create_ingredient() and plot it calling barh()

thai_ingredient_df = create_ingredient_df(thai_df)
thai_ingredient_df.head(10).plot.barh()

In [None]:
japanese_ingredient_df = create_ingredient_df(japanese_df)
japanese_ingredient_df.head(10).plot.barh()

In [None]:
chinese_ingredient_df = create_ingredient_df(chinese_df)
chinese_ingredient_df.head(10).plot.barh()

In [None]:
indian_ingredient_df = create_ingredient_df(indian_df)
indian_ingredient_df.head(10).plot.barh()

In [None]:
korean_ingredient_df = create_ingredient_df(korean_df)
korean_ingredient_df.head(10).plot.barh()

Drop very common ingredients (common to all cuisines)
They can create confusion between distinct cuisines

In [None]:
# Everyone loves rice, garlic and ginger!

feature_df= df.drop(['cuisine','Unnamed: 0','rice','garlic','ginger'], axis=1)
labels_df = df.cuisine #.unique()
feature_df.head()

Balance data with SMOTE (Synthetic Minority Over-sampling Technique) oversampling to the highest class. Read more here: https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html

In [None]:
# Call fit_resample(), this strategy generates new samples by interpolation

oversample = SMOTE()
transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)

In [None]:
# By balancing your data, you'll have better results when classifying it. Think about a binary classification.
# If most of your data is one class, a ML model is going to predict that class more frequently, just because there is more data for it.
# Balancing the data takes any skewed data and helps remove this imbalance.

# We can check the numbers of labels per ingredient

print(f'new label count: {transformed_label_df.value_counts()}')
print(f'old label count: {df.cuisine.value_counts()}')

In [None]:
# The data is nice and clean as well as balanced.
# The next step is to save the balanced data, including labels and features, into a new dataframe that can be exported into a file

transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')
transformed_df

In [None]:
# We can take one more look at the data using transformed_df.head() and transformed_df.info()

transformed_df.head()
transformed_df.info()

Save the file for future use

In [None]:
transformed_df.to_csv("../../data/cleaned_cuisines.csv")