In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv


### Refresher on Pandas basics
```python
hiking = pd.read_csv('datasets/hiking.json')
print(hiking.head())```

Preprocessing is like prerequisite

# Chapter 1

### Dealing with missing data
```python
df.isnull().sum()
df.notnull().sum()
# Subset the df dataset
df_subset = df[df['category_desc'].notnull()]
```

### Working with data types
```python
print(df.dtypes)```

- object: string/mixed types
- int64: integer
- float64: float
- datetime64(or timedelta): datetime

### Converting column types
```python
df['C'] = df['C'].astype('float')```

### Class distribution
#### Stratified Sampling
- 100 samples, 80 class 1 and 20 class 2
- Training set: 75 samples, 60 class 1 and 15 class 2
- Test set: 25 samples, 20 class 1 and 5 class 2

```python
y = y['labels']

X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y)
y_train['labels'].value_counts()```

# Chapter 2

### Standardizing data
1. Scikit-learn models assume normally distributed data if it isn't risk bias your model.
2. Applied to continuous numerical data

### When to standardize: models
1. Model in linear space
2. Dataset features have high variance can bias the model. If a feature has high variance it could impact model's ability to learn from other features
3. Dataset features are continuous and on different scales
4. Linearity assumptions

### Log normalization
- Good idea when a feature has a high variance
- Natural log using the constant e(2.718) --> log 30 is 3.4 , log 300 is 5.7, log 3000 is 8

```python
df["log_2"].var()
df['log_2'] = np.log(df['log_2'])
```

### Feature scalling
1. Features on different scales
2. Model with linear characteristics
3. Center features around 0 and transform to unit variance
4. Transforms to approximately normal distribution

```python 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)```

# Chapter 3

## Feature engineering
- Creation of new features based on existing features
- Insight into relationships between features
- Extract and expand data
- Dataset-dependent

### Encoding categorical variables
```python
# In pandas
df['feature_enc'] = df['feature'].apply(lambda val: 1 if val == 'y' else 0)

# Scikit-learn 
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['feature_enc'] = le.fit_transform(df['feature'])
```

### One hot encoding
```python
df['feature_multi_enc'] = pd.get_dummies(df['feature_multi'])```

### Engineering numerical features
```python
columns ['feature_1','feature_2','feature_3']
df['mean'] = df.apply(lambda row: row[columns].mean(),axis=1)```

### Dates
```python
df['date_converted'] = pd.to_datetime(df['date']) # Makes the extraction takes much easier
df['month'] = df['date_converted'].apply(lambda row: row.month)
```

### Engineering features from text
- \d --> grab digits and + --> grabs as many as possible
- \. --> grab the point decimal

In [2]:
import re
my_string = '75.6 F'
pattern = re.compile('\d+\.\d+') 
temp = re.match(pattern,my_string)
print(float(temp.group(0)))

75.6


### Vectorizing text
- tf = term frequency
- idf = inverse document frequency

In [3]:
train = pd.read_csv('/kaggle/input/titanic/train.csv')
train.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(train['Name'])
X = text_tfidf.toarray() # In order to get in the proper format for scikit-learn

In [5]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
# Split the dataset according to the class distribution of category_desc
y = train.Survived
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

nb = GaussianNB()
# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5112107623318386


### Text Classification
$P(A \ |\ B) = \frac{P(B\ |\ A)\ P(A)}{P(B}$ --> Naive Bayes. It treats each feature as independent from the others, which can be a naive assumption, but this works out well on text data.

```python
# Write a pattern to extract numbers and decimals
def return_mileage(text):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, text)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the text column and take a look at both columns
df["text_extract"] = df['text'].apply(lambda row: return_mileage(row))
print(df[["text", "text_extract"]].head())
```

# Chapter 4
## Feature Selection
- Selecting features to be used for modeling
- Doesn't create new features
- Improve model's performance

### Removing redundant features
- Redundant features might create noise when modeling
- Remove noisy features --> Exists in another form as another feature
- Remove correlated features
- Remove duplicated features

### Scenarios for manual removal
- City, State, lat, long --> It is Noise ! Choose one depend on your end goal, if it is specific use lat and long, or high-level state information.
- Extract number from text , you can drop the text feature.
- Took an average to use as an aggregate statistic, we could drop the values that generated the aggregate statistic.
- Features that have gone through the feature engineering process are redundant as well.

### Correlated features
- Statistically correlated: features move together directionally
- Linear models assume feature independence
- Pears correlation coefficient
```python
df.corr()```

### Selecting features using text vectors

#### Looking at word weights
After you've vectorized your text, the vocabulary and weights will be stored in the vectorizer

In [6]:
print(tfidf_vec.vocabulary_) # From titanic !

{'braund': 177, 'mr': 1012, 'owen': 1096, 'harris': 580, 'cumings': 296, 'mrs': 1013, 'john': 709, 'bradley': 174, 'florence': 458, 'briggs': 183, 'thayer': 1367, 'heikkinen': 599, 'miss': 983, 'laina': 786, 'futrelle': 490, 'jacques': 676, 'heath': 595, 'lily': 835, 'may': 941, 'peel': 1120, 'allen': 46, 'william': 1479, 'henry': 610, 'moran': 1000, 'james': 679, 'mccarthy': 944, 'timothy': 1386, 'palsson': 1099, 'master': 928, 'gosta': 529, 'leonard': 821, 'johnson': 710, 'oscar': 1089, 'elisabeth': 395, 'vilhelmina': 1430, 'berg': 138, 'nasser': 1030, 'nicholas': 1046, 'adele': 12, 'achem': 5, 'sandstrom': 1250, 'marguerite': 905, 'rut': 1232, 'bonnell': 165, 'elizabeth': 398, 'saundercock': 1256, 'andersson': 60, 'anders': 57, 'johan': 701, 'vestrom': 1425, 'hulda': 644, 'amanda': 51, 'adolfina': 16, 'hewlett': 613, 'mary': 925, 'kingcome': 764, 'rice': 1191, 'eugene': 431, 'williams': 1480, 'charles': 235, 'vander': 1419, 'planke': 1160, 'julius': 733, 'emelia': 409, 'maria': 907,

In [7]:
print(text_tfidf[3]) 

  (0, 1120)	0.4248477011002001
  (0, 941)	0.400586689697735
  (0, 835)	0.400586689697735
  (0, 595)	0.400586689697735
  (0, 676)	0.3833732281680796
  (0, 490)	0.400586689697735
  (0, 1013)	0.17507317025147795


In [8]:
print( text_tfidf[3].data)

[0.4248477  0.40058669 0.40058669 0.40058669 0.38337323 0.40058669
 0.17507317]


In [9]:
print( text_tfidf[3].indices)

[1120  941  835  595  676  490 1013]


In [10]:
vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()} # Swapping the key value pairs
print(vocab)

{177: 'braund', 1012: 'mr', 1096: 'owen', 580: 'harris', 296: 'cumings', 1013: 'mrs', 709: 'john', 174: 'bradley', 458: 'florence', 183: 'briggs', 1367: 'thayer', 599: 'heikkinen', 983: 'miss', 786: 'laina', 490: 'futrelle', 676: 'jacques', 595: 'heath', 835: 'lily', 941: 'may', 1120: 'peel', 46: 'allen', 1479: 'william', 610: 'henry', 1000: 'moran', 679: 'james', 944: 'mccarthy', 1386: 'timothy', 1099: 'palsson', 928: 'master', 529: 'gosta', 821: 'leonard', 710: 'johnson', 1089: 'oscar', 395: 'elisabeth', 1430: 'vilhelmina', 138: 'berg', 1030: 'nasser', 1046: 'nicholas', 12: 'adele', 5: 'achem', 1250: 'sandstrom', 905: 'marguerite', 1232: 'rut', 165: 'bonnell', 398: 'elizabeth', 1256: 'saundercock', 60: 'andersson', 57: 'anders', 701: 'johan', 1425: 'vestrom', 644: 'hulda', 51: 'amanda', 16: 'adolfina', 613: 'hewlett', 925: 'mary', 764: 'kingcome', 1191: 'rice', 431: 'eugene', 1480: 'williams', 235: 'charles', 1419: 'vander', 1160: 'planke', 733: 'julius', 409: 'emelia', 907: 'maria',

In [11]:
zipped_row = dict(zip(text_tfidf[3].indices, text_tfidf[3].data))
print(zipped_row)

{1120: 0.4248477011002001, 941: 0.400586689697735, 835: 0.400586689697735, 595: 0.400586689697735, 676: 0.3833732281680796, 490: 0.400586689697735, 1013: 0.17507317025147795}


In [12]:
def return_weights(vocab,vector,vector_index):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    return {vocab[i]:zipped[i] for i in vector[vector_index].indices}

print(return_weights(vocab,text_tfidf, 3))

{'peel': 0.4248477011002001, 'may': 0.400586689697735, 'lily': 0.400586689697735, 'heath': 0.400586689697735, 'jacques': 0.3833732281680796, 'futrelle': 0.400586689697735, 'mrs': 0.17507317025147795}


In [15]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))

[1430, 138, 1089]


In [28]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

### Dimensionality reduction
- Unsupervised learning method
- Combines/decomposes a feature space
- Feature extraction - here we'll use to reduce our feature space

### Principal Componen Analysis
- Linear transformation to uncorrelated space
- Captures as much variance as possible in each component

In [41]:
train.columns
train_num = train[['Pclass','Age','SibSp','Fare','Parch']]

In [43]:
from sklearn.decomposition import PCA
pca =PCA()
df_pca = pca.fit_transform(train_num.fillna(train_num.Age.median()))

By default , PCA in scikit-learn keeps the number of components equal to the number of input features

In [49]:
print(pca.explained_variance_ratio_) # We could drop those components that don't explain much variance

[9.35603583e-01 6.35890026e-02 4.75409826e-04 1.75562765e-04
 1.56442066e-04]


### PCA caveats
- Difficult to interpret components , more black box method than other methods of dimensionality reduction
- Do it PCA in end of your prerocessing journey
- Post processing after PCA are not useful in explaining variance