 # Data Cleaning and Transformation
 
For this notebook, I will be using the following libraries:

- pandas for data manipulation.
- numpy for mathematical operations.
- seaborn and matplotlib for data visualizing.
- sklearn for machine learning.
- scipy for statistical operations.

## 1- Import the required libraries

In [1]:
# Data manipulation and mathematical operations
import pandas as pd
import numpy as np 

# Statistical computations
from scipy.stats import norm
from scipy import stats



In [5]:
# Data visualization 

import seaborn as sns 
import matplotlib.pylab as plt
%matplotlib inline

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

This notebook includes some instructions that should be pursued by any data scientist before applying any machine learning algorithm. It is divided into two main parts as follow:

    - First part was used to present all required data cleaning instructions. 
    - Second part was used to illustrate all data transformation steps.

## 2- Data reading 

In [4]:
df=pd.read_csv('./data/Cleaned_DS_Jobs.csv')
df.head(2)

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Type of ownership,Industry,...,company_age,python,excel,hadoop,spark,aws,tableau,big_data,job_simp,seniority
0,Sr Data Scientist,137-171,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,Nonprofit Organization,Insurance Carriers,...,27,0,0,0,0,1,0,0,data scientist,senior
1,Data Scientist,137-171,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,Company - Public,Research & Development,...,52,0,0,1,0,0,0,1,data scientist,na


## 3- Data cleaning
### 3.1- Primarily exploration

####  a. More information 

Here is more information about the features and types

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          660 non-null    object 
 1   Salary Estimate    660 non-null    object 
 2   Job Description    660 non-null    object 
 3   Rating             660 non-null    float64
 4   Company Name       660 non-null    object 
 5   Location           660 non-null    object 
 6   Headquarters       660 non-null    object 
 7   Size               660 non-null    object 
 8   Type of ownership  660 non-null    object 
 9   Industry           660 non-null    object 
 10  Sector             660 non-null    object 
 11  Revenue            660 non-null    object 
 12  min_salary         660 non-null    int64  
 13  max_salary         660 non-null    int64  
 14  avg_salary         660 non-null    int64  
 15  job_state          660 non-null    object 
 16  same_state         660 non

#### b. General Overview

Here is a statistical description of numerical features

In [9]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rating,660.0,3.587424,1.18354,0.0,3.3,3.8,4.3,5.0
min_salary,660.0,99.29697,33.161485,31.0,79.0,91.0,122.0,212.0
max_salary,660.0,148.301515,48.264588,56.0,119.0,133.0,165.0,331.0
avg_salary,660.0,123.612121,39.786698,43.0,103.0,114.0,136.0,271.0
same_state,660.0,0.407576,0.491756,0.0,0.0,0.0,1.0,1.0
company_age,660.0,29.736364,39.763033,-1.0,5.0,16.0,37.25,239.0
python,660.0,0.730303,0.444139,0.0,0.0,1.0,1.0,1.0
excel,660.0,0.440909,0.496873,0.0,0.0,0.0,1.0,1.0
hadoop,660.0,0.212121,0.40912,0.0,0.0,0.0,0.0,1.0
spark,660.0,0.281818,0.450226,0.0,0.0,0.0,1.0,1.0


Including all features either numerical or categorical

In [14]:
df.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Job Title,660.0,168.0,Data Scientist,333.0,,,,,,,
Salary Estimate,660.0,30.0,75-131,31.0,,,,,,,
Job Description,660.0,481.0,Job Overview: The Data Scientist is a key memb...,12.0,,,,,,,
Rating,660.0,,,,3.587424,1.18354,0.0,3.3,3.8,4.3,5.0
Company Name,660.0,425.0,Hatch Data Inc,12.0,,,,,,,
Location,660.0,202.0,"San Francisco, CA",69.0,,,,,,,
Headquarters,660.0,227.0,"New York, NY",33.0,,,,,,,
Size,660.0,9.0,51 to 200 employees,128.0,,,,,,,
Type of ownership,660.0,13.0,Company - Private,386.0,,,,,,,
Industry,660.0,58.0,-1,71.0,,,,,,,


#### c. Values Count

If you want to know values count per feature. In the next example, I have selected 'avg_salary'

In [18]:
avg_salary_count=df['avg_salary'].value_counts().to_frame()
avg_salary_count

Unnamed: 0,avg_salary
107,43
92,42
136,41
114,40
106,38
103,31
115,31
105,31
154,29
99,28


#### d. Data frame shape 
My data frame contains 660 rows and 27 columns

In [15]:
df.shape

(660, 27)

#### e.  Columns names 
Determining columns name in my data frame. Here I have 27 columns as I mentioned before. 

In [16]:
df.columns

Index(['Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Type of ownership',
       'Industry', 'Sector', 'Revenue', 'min_salary', 'max_salary',
       'avg_salary', 'job_state', 'same_state', 'company_age', 'python',
       'excel', 'hadoop', 'spark', 'aws', 'tableau', 'big_data', 'job_simp',
       'seniority'],
      dtype='object')

#### f. Correlation and Correlation Matrix

#### g. Data Distribution

### 3.3- Cleaning and preprocessing

#### a. Handling the Duplicates

#### b. Handling missing values

#### c. Checking categorical and Numerical columns

#### e. Handling the Outliers

##### Finding the Outliers

##### Uni-variate Analysis (boxplot)

##### Bi-variate Analysis (scatter plot)

##### Deleting the Outliers

In [None]:




    1- Z-score Analysis
    2- 99th percentile  
    3-    
    

## 4- Data transformation  
### 4.1- Log Transformation

### 4.2- Data Normalization 

### 4.3- Data Standardisation 