## Introduction
This notebook is workshop for data prep.



#### Agenda
*  Loading Libraries
*  Loading Data
*  Getting Basic Idea About Data
*  Summary statistic
*  Missing Values and Dealing with Missing Values
*  One Hot Encoding (Creating dummies for categorical columns)
*  Standardization / Normalization
*  Splitting the dataset into train and test data
*  Dealing with Imbalanced Data
  



## Loading Libraries
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.

In data science, numpy and pandas are most commonly used libraries. Numpy is required for calculations like means, medians, square roots, etc. Pandas is used for data processin and data frames. We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd).

In [None]:
import pandas as pd                  # A fundamental package for linear algebra and multidimensional arrays
import numpy as np                   # Data analysis and data manipulating tool
import random                        # Library to generate random numbers
from collections import Counter      # Collection is a Python module that implements specialized container datatypes providing
                                     # alternatives to Python’s general purpose built-in containers, dict, list, set, and tuple.
                                     # Counter is a dict subclass for counting hashable objects
# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# To ignore warnings in the notebook
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Install Pandas-Profiling
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

## Loading Data
Pandas module is used for reading files. We have our data in '.csv' format. We will use 'read_csv()' function for loading the data.

**Disclaimer:** Loading fraud data will take time.

In [None]:
# Upload dataset from local
from google.colab import files
uploades = files.upload()

In [None]:
# Loading Data
df = pd.read_csv("Mall_Customers.csv")


### Getting Basic Idea About Data

In [None]:
df.head()

In [None]:
df.info()      # Returns a concise summary of dataset

There are ....columns with ... observations.

In [None]:
# Taking a look at the target variable
df.value_counts("")       # The value_counts() function is used to get a Series containing counts of unique values.

In [None]:
# we can also use countplot form seaborn to plot the above information graphically.
sns.countplot(df[""])

# summary statistic

In [None]:
profile = ProfileReport(df, title = "Mall Customers Profiling Report")
profile

### Missing values
Generally datasets always have some missing values. May be done during data collection, or due to some data validation rule.


In [None]:
 # To get percentage of missing data in each column



Out of .... columns, ... have some missing values.

### Dealing with Missing Values
*  Filling the missing values with right technique can change our results drastically.
*  Also, there is no fixed rule of filling the missing values.




In [None]:
# Dealing with Missing Values


In [None]:
# Let's have a look if there still exist any missing values
df.isnull().sum()

Notice, now we don't have any column with missing value.

### One Hot Encoding (Creating dummies for categorical columns)
In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column. In Python there is a class 'OneHotEncoder' in 'sklearn.preprocessing' to do this task, but here we will use pandas function 'get_dummies()'. This get_dummies() does the same work as done by 'OneHotEncoder' form sklearn.preprocessing.


In [None]:
df.head()

### Standardization / Normalization



In [None]:
from sklearn.preprocessing import StandardScaler

scaled_features = StandardScaler().fit_transform(X)
scaled_features = pd.DataFrame(data=scaled_features)
scaled_features.columns= X.columns

In [None]:
# Let's see how the data looks after scaling
scaled_features.head()

### Splitting the dataset into train and test data

We will keep 30% of the data for test set.

In [None]:
# Splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)



# test_size = 0.3: 30% of the data will go for test set and 70% of the data will go for train set
# random_state = 42: this will fix the split i.e. there will be same split for each time you run the code

# Verify the Imbalanced Data

In [None]:
#Verify the Imbalanced Data

# Dealing with Imbalanced Data






In [None]:
# Dealing with Imbalanced Data
