<a href="https://colab.research.google.com/github/Rohanrathod7/my-ml-labs/blob/main/11_Preprocessing_for_Machine_Learning_in_Python/05_Putting_It_All_Together.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 5. Putting It All Together

In [35]:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import datetime as dt
# Import confusion matrix and train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import Ridge, Lasso, LogisticRegression, LinearRegression
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier




url = "https://raw.githubusercontent.com/Rohanrathod7/my-ml-labs/main/11_Preprocessing_for_Machine_Learning_in_Python/dataset/ufo_updated.csv"
# Read the CSV file
# The original code tried to read a feather file as a CSV, and had a UnicodeDecodeError.
# The file extension is feather, so it should be read using pd.read_feather.
# Also, the variable name was confusing, it should be spotify_population.
ufo = pd.read_csv(url)
display(ufo.head())

url = "https://raw.githubusercontent.com/Rohanrathod7/my-ml-labs/main/10_Dimensionality_Reduction_in_Python/dataset/height_df.csv"
# Read the CSV file
# The original code tried to read a feather file as a CSV, and had a UnicodeDecodeError.
# The file extension is feather, so it should be read using pd.read_feather.
# Also, the variable name was confusing, it should be spotify_population.
height_df = pd.read_csv(url)
display(height_df.head())

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,...,light,other,oval,rectangle,sphere,teardrop,triangle,unknown,month,year
0,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.021389,...,0,0,0,0,0,0,1,0,11,2002
1,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.715278,...,1,0,0,0,0,0,0,0,6,2012
2,2013-06-09 00:00:00,oakville (canada),on,ca,light,120.0,2 minutes,Brilliant orange light or chinese lantern at o...,7/3/2013,43.433333,...,1,0,0,0,0,0,0,0,6,2013
3,2013-04-26 23:27:00,lacey,wa,us,light,120.0,2 minutes,Bright red light moving north to north west fr...,5/15/2013,47.034444,...,1,0,0,0,0,0,0,0,4,2013
4,2013-09-13 20:30:00,ben avon,pa,us,sphere,300.0,5 minutes,North-east moving south-west. First 7 or so li...,9/30/2013,40.508056,...,0,0,0,0,1,0,0,0,9,2013


Unnamed: 0,weight_kg,height_1,height_2,height_3,height
0,81.5,1.78,1.8,1.8,1.793333
1,72.6,1.7,1.7,1.69,1.696667
2,92.9,1.74,1.75,1.73,1.74
3,79.4,1.66,1.68,1.67,1.67
4,94.6,1.91,1.93,1.9,1.913333


***Checking column types***  
Take a look at the UFO dataset's column types using the .info() method. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

In [36]:
# Print the DataFrame info
print(ufo.info())

# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype("float")

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo.info())

# job on transforming the column types! This will make feature engineering and standardization much easier.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1866 entries, 0 to 1865
Data columns (total 37 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            1866 non-null   object 
 1   city            1866 non-null   object 
 2   state           1866 non-null   object 
 3   country         1718 non-null   object 
 4   type            1866 non-null   object 
 5   seconds         1866 non-null   float64
 6   length_of_time  1866 non-null   object 
 7   desc            1866 non-null   object 
 8   recorded        1866 non-null   object 
 9   lat             1866 non-null   float64
 10  long            1866 non-null   float64
 11  minutes         1725 non-null   float64
 12  seconds_log     1866 non-null   float64
 13  country_enc     1866 non-null   int64  
 14  changing        1866 non-null   int64  
 15  chevron         1866 non-null   int64  
 16  cigar           1866 non-null   int64  
 17  circle          1866 non-null   i

**Dropping missing data**  
In this exercise, you'll remove some of the rows where certain columns have missing values. You're going to look at the length_of_time column, the state column, and the type column. You'll drop any row that contains a missing value in at least one of these three columns.

In [37]:
# Count the missing values in the length_of_time, state, and type columns, in that order
print(ufo[["length_of_time", "state",  "type"]].isna().sum())

# Drop rows where length_of_time, state, or type are missing
ufo_no_missing = ufo.dropna(subset=["length_of_time", "state",  "type"])

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

length_of_time    0
state             0
type              0
dtype: int64
(1866, 37)


### Categorical Variable and Standardization

**Extracting numbers from strings**
The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.

In [38]:
import re
import numpy as np

def return_minutes(time_string):
        time_string = str(time_string)
        # Use \d+ to grab digits
        num = re.search("\d+", time_string)
        if num is not None:
            return int(num.group(0))


# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

# Take a look at the head of both of the columns
print(ufo[["length_of_time", "minutes"]].head())

# The minutes information is now in a form where it can be inputted into a model.

    length_of_time  minutes
0  about 5 minutes      5.0
1       10 minutes     10.0
2        2 minutes      2.0
3        2 minutes      2.0
4        5 minutes      5.0


**Identifying features for standardization**  
In this exercise, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normalize the seconds column.

In [39]:
# Check the variance of the seconds and minutes columns
print(ufo[["seconds", "minutes"]].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo["seconds"])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

seconds    424087.417474
minutes       117.907176
dtype: float64
1.122392388118297


### Engineering new features

**Encoding categorical variables**  
There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

In [40]:
# Use pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda x:1 if x=="us" else 0)

# Print the number of unique type values
print(len(ufo["type"].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo["type"])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

21


**Features from dates**  
Another feature engineering task to perform is month and year extraction. Perform this task on the date column of the ufo dataset.

In [41]:
# Look at the first 5 rows of the date column
print(ufo["date"].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].dt.month

# Extract the year from the date column
ufo["year"] = ufo["date"].dt.year

# Take a look at the head of all three columns
print(ufo[["date", "month", "year"]].head())

0   2002-11-21 05:45:00
1   2012-06-16 23:00:00
2   2013-06-09 00:00:00
3   2013-04-26 23:27:00
4   2013-09-13 20:30:00
Name: date, dtype: datetime64[ns]
                 date  month  year
0 2002-11-21 05:45:00     11  2002
1 2012-06-16 23:00:00      6  2012
2 2013-06-09 00:00:00      6  2013
3 2013-04-26 23:27:00      4  2013
4 2013-09-13 20:30:00      9  2013


**Text vectorization**  
You'll now transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

In [42]:
# Take a look at the head of the desc field
print(ufo["desc"].head())

# Instantiate the tfidf vectorizer object
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()

# Fill missing values in 'desc' with empty strings
ufo["desc"] = ufo["desc"].fillna("")

# Fit and transform desc using vec
desc_tfidf = vec.fit_transform(ufo["desc"])

# Look at the number of columns and rows
print(desc_tfidf.shape)

# You'll notice that the text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.

0    It was a large&#44 triangular shaped flying ob...
1    Dancing lights that would fly around and then ...
2    Brilliant orange light or chinese lantern at o...
3    Bright red light moving north to north west fr...
4    North-east moving south-west. First 7 or so li...
Name: desc, dtype: object
(1866, 3422)


### Feature selection and modeling

**Selecting the ideal dataset**  
Now to get rid of some of the unnecessary features in the ufo dataset. Because the country column has been encoded as country_enc, you can select it and drop the other columns related to location: city, country, lat, long, and state.

You've engineered the month and year columns, so you no longer need the date or recorded columns. You also standardized the seconds column as seconds_log, so you can drop seconds and minutes.

You vectorized desc, so it can be removed. For now you'll keep type.

You can also get rid of the length_of_time column, which is unnecessary after extracting minutes.

In [43]:
# Make a list of features to drop
to_drop = ["city", "country", "date", "desc", "lat", "length_of_time", "long", "minutes", "recorded", "seconds", "state"]

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
# filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

# You'll notice that the text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.

**Modeling the UFO dataset, part 1**  
In this exercise, you're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. The X dataset contains the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y labels are the encoded country column, where 1 is "us" and 0 is "ca".

In [44]:
# Define the features (X) and target (y)
X = ufo[['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone', 'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash', 'formation', 'light', 'other', 'oval', 'rectangle',
           'sphere', 'teardrop', 'triangle', 'unknown', 'month', 'year']]
y = ufo['country_enc']

# Take a look at the features in the X set of data
print(X.columns)

# Split the X and y sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Instantiate the KNeighborsClassifier
knn = KNeighborsClassifier()

# Fit knn to the training sets
knn.fit(X_train, y_train)

# Print the score of knn on the test sets
print(knn.score(X_test, y_test))

Index(['seconds_log', 'changing', 'changing', 'chevron', 'chevron', 'cigar',
       'cigar', 'circle', 'circle', 'cone', 'cone', 'cross', 'cross',
       'cylinder', 'cylinder', 'diamond', 'diamond', 'disk', 'disk', 'egg',
       'egg', 'fireball', 'fireball', 'flash', 'flash', 'formation',
       'formation', 'light', 'light', 'other', 'other', 'oval', 'oval',
       'rectangle', 'rectangle', 'sphere', 'sphere', 'teardrop', 'teardrop',
       'triangle', 'triangle', 'unknown', 'unknown', 'month', 'year'],
      dtype='object')
0.8693790149892934


**Modeling the UFO dataset, part 2**  
Finally, you'll build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if you can predict the type of the sighting based on the text. You'll use a Naive Bayes model for this.

In [47]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Use the list of filtered words we created to filter the text vector
# filtered_text = desc_tfidf[:, list(filtered_words)]

# Use the entire text vector as the feature set
filtered_text = desc_tfidf

# Split the X and y sets using train_test_split, setting stratify=y
# Assuming 'y' is defined from the previous modeling step
X_train, X_test, y_train, y_test = train_test_split(filtered_text, y, stratify=y, random_state=42)

# Instantiate the MultinomialNB model
nb = MultinomialNB()

# Fit nb to the training sets
nb.fit(X_train, y_train)

# Print the score of nb on the test sets
print(nb.score(X_test, y_test))

# you've completed the course! As you can see, this model performs very poorly on this text data.
# This is a clear case where iteration would be necessary to figure out what subset of text improves
# the model, and if perhaps any of the other features are useful in predicting type

0.880085653104925
