# **DATA PRE-PROCESSING INTUITION**

Python Packages
- pandas - Handle data in a way suited to analysis; Since you are an R user, you will love pandas as it supports dataframes.
- matplotlib - Very good for plotting graphs and figures.
- scikit-learn - For machine learning, look no further.
- numpy: Multi-dimensional arrays and Matrices; Mathmatical functions. 
- and scipy - Amazing modules for scientific computing.
- beautifulsoup4 - web scraping.
- django - web framework for building web applications in Python. 


# **DATA WRANGLING**  

Data Wrangling: Process of gathering, extracting, cleaning, and storing data.

We need to assess our data to:

- Test assumptions about:  
    - Values that are there  
    - Data types  
    - Shape
- Identify errors or outliers   
- Find missing values  

## DATA EXTRACTION 
Acquiring data often isn't fancy     
Find stuff on the internet!      
Alot of data stored in textfiles and on govt websites      


### **CSV: Comman Separated Value**  

Tabular data  
Row: item in a dataset  
Column: fields for the data items  
Cells: Individual values for a field    

**Why CSV formats are used**  
Lightweight  
Each line of text is a single row  
Fields are separated by a delimeter  
Just the data (and delimeters) itself    

Like a spreadsheet with no formulas,    
Easy to process with code (unlike .xlsx)     
All spreadhseet apps read/write csv  
Don't need special software (ie, Excel)  
- If the file is big, opening  it in a spreadsheet app like Excel can be slow, inefficient, or maybe even impossible. 
- May also want to programatically process tabular data because we may have alot of files to process and doing it manually in a spreadsheet app isn't possible.   

**Python Code**
#To read a CSV  
pd.read_csv('baseball_data')

#To write to a CVS  
baseball_data.to_csv('baseball_data_with_weight_height.csv')    

## XLRD  
Excel files  

## **XML**   
resembles html    

**XML Design Goals: **   
Data transfer that is platform independent      
Easy to write code to read/write      
Document validation    

**XML Standard:**  
Robust parsers in most languages    
We can focus on our app  
It's also free, as in beer   

**Best Practices for Scraping**
1. Look at how browser makes requests  
2. Emulate in code  

## **JSON**  
Looks like a Python dictionary with curly braces (value is associated with a key).   
Supports nested structures in a way that csv documents cannot.

Data Modeling in Json  
Items may have different fields  
May have nested objects  
May have nested arrays  

Most like to encounter json data through a web service.  A web service is a database you can access using http requests

**Python Code**    

#Imports  
import json  
import requests  

#Provide URL we should make an API call  
url = 'http://www.mylink.com'  

#Make our API call using the requests library and load the results into a dictionary  
data = requests.get(url).text     
data = json.loads(data)  
print type(data)    

#Print out the name of the no. 1 top artist  
print data['topartists']['artist'][0]['name']

---
# DATA CLEANING
### Reality is: much of your time as a data scientist will be spent preparing and ‘cleaning’ your data of: outliers, missing data, malicious data, erroneous data, irrelevant data, inconsistent data and formatting  

### Sources of Dirty Data
User entry errors  
Poorly applied coding standards    
Different Schemas    
Legacy systems    
Evolving applications  
No unique identifiers    
Data migration  
Programmer error  
Corruption in transmission   

### **Sanity Checking Data**  
Does the data make sense?  
Is there a problem?  
Does the data look like I expect it to?  

#Look at mean, min, and max values, an any differences in count  
dataset.describe() 

### Measures of Data Quality  
1. Validity: Conforms to a schema  
2. Accuracy: Conforms to gold standard    
3. Completeness: All records?    
4. Consistency: Matches other data  
5. Uniformity: Same units  

### Blueprint for Cleaning  
- Audit your data    
- Create a data cleaning plan
   - Identify causes
   - Define Operations
   - Test  
- Execute the plan 
- Manually correct  

Iterate until you have confidence in your data 

### **Types of Corrections**  
Removing/correcting typos  
Validating against known entities    
Data enhancement    
Data harmonization (St v Street)   
Changing reference data (USA) 

---

## **MISSING VALUES**  

### **Reasons for Missing Values**
- Occasional failures: Occasional system errors prevent data from being recorded or Nonresponse        
- Some subset of subjects or event types are systematically missing certain data attributes, or missing entirely.  

Important point: if the missing values from your data are distributed at random, your data may still be representative of the population. However if values are missing systematically, it could invalidate your findings.  Check your data to see if such effects are present. 

### **Identify Missing Values**  

#Visualize missing values  
sns.heatmap(dataset.isnull(),yticklabels = False, cbar = False, cmap = 'viridis')  
plt.show()  

#Count of missing values  
dataset.isnull().sum()    

#Percentage of values missing  
dataset.isnull().sum()/len(dataset)*100  

### **Dealing with Missing Data**    

### **1.Partial Deletion**  
Limiting our data set for analysis to the data we have available to us  
- **Listwise Deletion:** exclude a particular datapoint from **ALL** analysis even if some useful values are present  
- **Pairwise Deletion:** exclude a particular case from the analysis for tasks which are not possible with the data at hand 

### **2.Impution**  
Used in scenarios which we do not have much data or removing missing values will compromise the representativeness of our sample.    
Imputing and Linear Regression are two methods that are simple and relatively effective but can have negative side effects (obscure or amplify trends).     

**Imputing the mean **  
Drawback: Lessens correlations between variables     

#Impute based on column mean  
baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']))      
      
#Imputate mean based on condition     
def impute_age(cols):    
    Age = cols[0]    
    Pclass = cols[1]    
    if pd.isnull(Age):    
        if Pclass == 1:    
            return 37    
        elif Pclass == 2:    
            return 29    
        else:    
            return 24    
    else:    
        return Age    
        
#Apply the function to the Age column    
train['Age']=train[['Age','Pclass']].apply(impute_age, axis =1 )    

**Imputing Using Linear Regression **  
Fit linear model to estimate the missing values  
Drawbacks:  Overemphasize existing trends in the data (imputed values will amplify this trend); Exact values suggest too much certainty  

### 3. Remove From Dataset Completely  

#Remove any rows missing from a specific column  
dataset = dataset[pd.notnull(dataset['Price'])]  

#Remove all rows missing a value in any column  
dataset = dataset.dropna() # set axis to 1 to drop entire columns that have a missing value  

#Remove specfic column with alot of missing values      
train.drop('Cabin', axis = 1, inplace = True)  

---

## **DEALING WITH OUTLIERS**   

### **Identify Outliers**    

**Kurtosis (Height of Peaks)**  
Can be used to check for outliers     
High kurtosis may indicate problems with outliers  

**Standard Deviation**  
Identify outliers by looking at the number of standard deviations from the median (or mean) in a more principled manner rather   than an arbitrary cutoff.    
What multiple? You just have to use common sense    
  
**Outliers Function Using Standard Deviation**      
def reject_outliers(data):    
    u = np.median(data)    
    s = np.std(data)    
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]    
    return filtered    
filtered = reject_outliers(incomes)    
plt.hist(filtered, 50)    
plt.show()      

---
## **CATEGORICAL DATA **   
Machine Learning algorithms require everything to be numeric.  
With qualitative (categorical) variables we can implement two methods:
1. Enumerate  
2. Create dummy variables  

### Enumerate  
First one is to check distribution of the variable with respect to variable values and enumerate them.    

**Replacing Column values in a dataframe**    
adh['Gender'] = adh['Gender'].map({'M': 0, 'F': 1})  

### Dummy Variables
Second to create dummy variable for each possible category.    

#Convert categorical variables into "dummy" or indicator variables     
dSex = pd.get_dummies(train['Sex'], drop_first = True) # drop_first prevents multi-collinearity    
dEmbark = pd.get_dummies(train['Embarked'], drop_first = True)    

#Add new dummy columns to data frame  
train = pd.concat([train,dSex,dEmbark],axis = 1)  
  
#Drop unnecessary columns  
train.drop(['Sex', 'Embarked','Name','Ticket'], axis = 1, inplace = True)    

---

## FEATURE SCALING   
Feature Scaling = changing the range of features  


### **Why Feature Scaling?**  
Many algorithms compute the Eucilidean Distance between two observations and if one of the features is vastly larger than another, the distance will be biased towards that particular feature.  So for many machine learning algorithms, it's important to scale- or normalize (or standardize)-  the data before using it.  

Need to ask yourself, if your model is based on several numerical attributes – are they comparable?  

### **Read the docs**  
- Mot data mining and machine learning techniques work fine with raw, un-normalized data but double check the one you’re using before you start  
    - Some models are ok with data that is not normalized (regression)  
- But some models may not perform well when different attributes are on very different scales  
    - It can result in some attributes counting more than others  
    - Bias in the attributes can also be a problem    

  
### ** Algorithms That Require Feature Scaling**    
- SVM with RBF kernel
- k-Means Clustering  

### **Feature Scaling  Not Necessary For: **   
- Decision Trees    
- Linear Regression  

### **Two Specific Methods Feature Scaling Methods:**
1. Normalization  
2. Standardization  

## Normalization    
Making the range of feature values between 0 and 1  
Re-scaling features so that they always span comparable ranges  
Still contain the same info but just expressed in different units
Easiest technique of scaling features but is not so useful for data with outliers    

**Normalization Formula**    
x prime (new rescaled feature) = x - x min / x max - x min

## **Standardizing Numerical Data**   
Variable has been rescaled to have a mean of 0 and standard deviation of 1   
Usually preferered because it is less outlier sensitive  

**Standardization Formula**    
For each value, subtract the mean and then divide by the standard deviation  

**Standardization in sklearn**  
Scikit-learn PCA implementation has a “whiten” option that does this for you. Use it!    
Scikit-learn has a preprocessing module with handy normalize and scale functions    
- Your data may have ‘yes’ and ‘no’ that needs to be converted to 1 and 0    

### Don’t forget to re-scale your results when you’re done (to interpret the results you get)  

---

# **EXPLORATION / DATA VISUALIZATION**  
Build intuition & Find patterns  



### Plots for categorical data
In [6]:  
for c in qualitative:  
    train[c] = train[c].astype('category')  
    if train[c].isnull().any():  
        train[c] = train[c].cat.add_categories(['MISSING'])  
        train[c] = train[c].fillna('MISSING')    
def boxplot(x, y, **kwargs):  
    sns.boxplot(x=x, y=y)  
    x=plt.xticks(rotation=90)  
f = pd.melt(train, id_vars=['SalePrice'], value_vars=qualitative)  
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)  
g = g.map(boxplot, "value", "SalePrice")    


### Time Series Data  
Loess Curve    
Plotted over data points  
Emphasize long term trends rather than year-to-year variability  
Form of weighted regression