* We will investigate UFO data over the last century to gain some insight.
* Please use all the techniques we have learned in the class to preprocesss/clean the dataset <p style="color:blue"><b>ufo_sightings_large.csv</b></p>. 
* After the dataset is preprocessed, please split the dataset into training sets and test sets
* Fit KNN to the training sets. 
* Print the score of KNN on the test sets

## 1. Import dataset "ufo_sightings_large.csv" in pandas 

In [3]:
import pandas as pd
ufo=pd.read_csv("ufo_sightings_large.csv")
# Print the shape of the Dataset
print("==========================")
print("Shape of the ufo_sightings")
print("==========================")
print(ufo.shape)

Shape of the ufo_sightings
(4935, 11)


## 2. Checking column types & Converting Column types
Take a look at the UFO dataset's column types using the dtypes attribute. Please convert the column types to the proper types.
For example, the date column, which can be transformed into the datetime type. 
That will make our feature engineering efforts easier later on.

In [4]:
# TWe use the pandas package to call the method to_datetime() for converting the date column type to datetime
# Similarly we can convert using the .astype()

print("=================================")
print("Before changing the column types")
print("=================================")
# Check the column types
print(ufo.dtypes)

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
# print(ufo[["seconds","date"]].dtypes)
print("=================================")
print("After changing the column types")
print("=================================")
# Check the column types
print(ufo.dtypes)

Before changing the column types
date               object
city               object
state              object
country            object
type               object
seconds           float64
length_of_time     object
desc               object
recorded           object
lat                object
long              float64
dtype: object
After changing the column types
date              datetime64[ns]
city                      object
state                     object
country                   object
type                      object
seconds                  float64
length_of_time            object
desc                      object
recorded                  object
lat                       object
long                     float64
dtype: object


## 3. Dropping missing data
Let's remove some of the rows where certain columns have missing values. 

In [5]:
# Dropping using the dropna()
ufo_no_missing=ufo.dropna()
print("====================================================")
print("Shape of the Dataset after removing the missing data")
print("====================================================")
print(ufo_no_missing.shape)

Shape of the Dataset after removing the missing data
(3891, 11)


In [6]:
# Dropping missing data

# Check the if the column contains null value using isnull() and count the null values using the sum()
# print("==================================================================================")
# print("Number of data missing in columns length_of_time, state, type, country, desc, city")
# print("==================================================================================")
#print(ufo[["length_of_time", "state", "type","country","desc","city"]].isnull().sum())

# Keeping the rows which are not null
# ufo_no_missing = ufo[ufo["length_of_time"].notnull() & 
        # ufo["state"].notnull() &
        # ufo["type"].notnull()&ufo["country"].notnull()&ufo["desc"].notnull()&ufo["city"].notnull()]

# print("====================================================")
# print("Shape of the Dataset after removing the missing data")
# print("====================================================")
# Print out the shape of the new dataset
# print(ufo_no_missing.shape)
# Print the columns of the new dataset
# print("=============================")
# print("Columns of the new data frame")
# print("=============================")
# print(ufo_no_missing.columns)

In [7]:
# This is to remove the 0.0 values from the seconds columns which will impact the variance and calculating the score
ufo_no_missing = ufo_no_missing[(ufo_no_missing[['seconds']] != 0).all(axis=1)]
print("===================================")
print("Seconds column after removing zeros")
print("===================================")
print(ufo_no_missing["seconds"].head())

Seconds column after removing zeros
0    1209600.0
1         30.0
3        300.0
5        600.0
6        600.0
Name: seconds, dtype: float64


## 4. Extracting numbers from strings
The <b>length_of_time</b> column in the UFO dataset is a text field that has the number of 
minutes within the string. 
Here, you'll extract that number from that text field using regular expressions.

In [8]:
# As the column "length of time" contains data in the form of weeks, hours, seconds, unrelevant data. 
# We will extract the data which is in the notation that has words "minutes or minute". We are not considering the 
# data which has min, mi, or h:mm pattern

# Copying the data set into ufo_copy instead of modifying the original data

ufo_copy=ufo_no_missing.copy()

# Extracting the data with word minutes or minute
ufo_copy["extracted_minutes"]=ufo_copy["length_of_time"].str.extract("(.*\d+?.minutes|\d.?minute)",expand=False)

# Printing the shape of the copied Dataset
print("====================")
print("Shape of new Dataset")
print("====================")
print(ufo_copy.shape)

# Looking into Dataset by using the head()
print("=============================================================================")
print("New Dataset")
print("=============================================================================")
print(ufo_copy.head())

Shape of new Dataset
(3723, 12)
New Dataset
                 date       city state country      type    seconds  \
0 2011-11-03 19:21:00  woodville    wi      us   unknown  1209600.0   
1 2004-10-03 19:05:00  cleveland    oh      us    circle       30.0   
3 2002-11-21 05:45:00   clemmons    nc      us  triangle      300.0   
5 2012-06-16 23:00:00  san diego    ca      us     light      600.0   
6 2009-07-12 21:30:00     duluth    mn      us      oval      600.0   

              length_of_time  \
0                    2 weeks   
1                     30sec.   
3            about 5 minutes   
5                 10 minutes   
6  total? maybe around 10 mi   

                                                desc  recorded         lat  \
0  Red blinking objects similar to airplanes or s...  12/12/11  44.9530556   
1               Many fighter jets flying towards UFO  10/27/04  41.4994444   
3  It was a large&#44 triangular shaped flying ob...  12/23/02  36.0213889   
5  Dancing lights that w

In [9]:
# The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. 
# Here, you'll extract that number from that text field using regular expressions.
# Incase you want all the numbers from the "length_of_time" column comment the above section
# Extracting the number from the "extracted minutes" column
import re

# By using the match() it'll ectract the data if it encoured in the first position instead of searching the whole string
# By using the search() it'll extract the data at any position in the string

# Defining the function return_minutes 
def return_minutes(time_string):
            pattern = re.compile(r"\d+")
            num = re.match(pattern, str(time_string))
            if num is not None:
                return int(num.group(0))
# Apply the extraction to the extracted_minutes column
ufo_copy["minutes"] = ufo_copy["extracted_minutes"].apply(return_minutes)
# Take a look at the head of both of the columns
print("===========================================================")
print("Extracted_minutes and minutes columns from ufo_copy Dataset")
print("===========================================================")
print(ufo_copy[["extracted_minutes","minutes"]].head())

Extracted_minutes and minutes columns from ufo_copy Dataset
  extracted_minutes  minutes
0               NaN      NaN
1               NaN      NaN
3   about 5 minutes      NaN
5        10 minutes     10.0
6               NaN      NaN


## 5. Identifying features for standardization 
In this section, you'll investigate the variance of columns in the UFO dataset to 
determine which features should be standardized. You can log normlize the high variance column.

In [10]:
# Import the numpy package to use the log() for normalization
import numpy as np

# Printing the shape of the Dataset
print("=============================")
print("Columns of the copied dataset")
print("=============================")
print(ufo_copy.columns)
# Printing the variance of the columns of minutes and seconds
print("===============================")
print("Variance of seconds and minutes")
print("===============================")
# print(ufo_copy[["seconds","minutes"]].var())
print(ufo_copy[["seconds"]].var())
print(ufo_copy[["minutes"]].var())
# Performing the normalization on the seconds column using the log()
ufo_copy["seconds_after_log"] = np.log(ufo_copy["seconds"])
# print(ufo_copy["seconds_after_log"])
# Print out the variance of just the seconds_log column
print("=======================================")
print("Variance of seconds after normalization")
print("=======================================")
print(ufo_copy["seconds_after_log"].var())

Columns of the copied dataset
Index(['date', 'city', 'state', 'country', 'type', 'seconds', 'length_of_time',
       'desc', 'recorded', 'lat', 'long', 'extracted_minutes', 'minutes'],
      dtype='object')
Variance of seconds and minutes
seconds    1.767437e+10
dtype: float64
minutes    112.782792
dtype: float64
Variance of seconds after normalization
4.83635013345897


## 6. Encoding categorical variables
There are couple of columns in the UFO dataset that need to be encoded before they can be 
modeled through scikit-learn. 
You'll do that transformation here, <b>using both binary and one-hot encoding methods</b>.

In [11]:
print("====================================")
print("Binary encoded of the country column")
print("====================================")
# The LabelEncoder is used to Label the column names uniquely. As after dropping the rows which are null we are left 
# with three country names which will return [0,1,2]
# from sklearn.preprocessing import LabelEncoder
# country_enc = LabelEncoder()
# ufo_copy["country_enc"] = country_enc.fit_transform(ufo_copy["country"])
# print(ufo_copy["country_enc"].unique())
ufo_copy["country_code_enc"] = np.where(ufo_copy["country"].str.contains("us"), 1, 0)
print(ufo_copy["country_code_enc"].head())
#print(ufo_copy["country_code_enc"].head())
# Using the get_dummies() method from pandas package
# ufo_copy["type"] = pd.get_dummies(ufo_copy["type"])
ufo_copy_tenc=pd.get_dummies(ufo_copy["type"])
ufo_copy = pd.concat([ufo_copy, ufo_copy_tenc], axis=1)
print("==================================")
print("One-hot encoded of the type column")
print("==================================")
print(ufo_copy.head())

Binary encoded of the country column
0    1
1    1
3    1
5    1
6    1
Name: country_code_enc, dtype: int64
One-hot encoded of the type column
                 date       city state country      type    seconds  \
0 2011-11-03 19:21:00  woodville    wi      us   unknown  1209600.0   
1 2004-10-03 19:05:00  cleveland    oh      us    circle       30.0   
3 2002-11-21 05:45:00   clemmons    nc      us  triangle      300.0   
5 2012-06-16 23:00:00  san diego    ca      us     light      600.0   
6 2009-07-12 21:30:00     duluth    mn      us      oval      600.0   

              length_of_time  \
0                    2 weeks   
1                     30sec.   
3            about 5 minutes   
5                 10 minutes   
6  total? maybe around 10 mi   

                                                desc  recorded         lat  \
0  Red blinking objects similar to airplanes or s...  12/12/11  44.9530556   
1               Many fighter jets flying towards UFO  10/27/04  41.4994444   
3 

## 7. Text vectorization (10 points)
Let's transform the <b>desc</b> column in the UFO dataset into tf/idf vectors, 
since there's likely something we can learn from this field.

In [12]:
# Check how many values are missing in the desc columns
print("========================================")
print("Count the null values in the desc column")
print("========================================")
print(ufo_copy[["desc"]].isnull().sum())

# Look at the desc column in the ufo Dataset
print("=========================")
print("Look into the desc column")
print("=========================")
print(ufo_copy["desc"].head())

from sklearn.feature_extraction.text import TfidfVectorizer

# create object for the TfidfVectorizer. tfidf_v is the short notion for the tfidf_vector
tfidf_v = TfidfVectorizer()

# Transform the text into tf_idf vectors
transform_text_tfidf_v = tfidf_v.fit_transform(ufo_copy["desc"])

print("===================================")
print("Shape of the transform_text_tfidf_v")
print("===================================")
# Print the transform_text_tfidf_v shape
print(transform_text_tfidf_v.shape)
# print(tfidf_v.get_feature_names())

Count the null values in the desc column
desc    0
dtype: int64
Look into the desc column
0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
3    It was a large&#44 triangular shaped flying ob...
5    Dancing lights that would fly around and then ...
6    A minor amber color trail&#44 (from where we w...
Name: desc, dtype: object
Shape of the transform_text_tfidf_v
(3723, 5209)


## 8. Selecting the ideal dataset 
Let's get rid of some of the unnecessary features. 

In [13]:
# Printing the date column in the ufo_copy Dataset
print("===================================")
print("Date column of the ufo_copy Dataset")
print("===================================")
print(ufo_copy["date"].head())
# Extract day from the date column
ufo_copy["day"] = ufo_copy["date"].apply(lambda row: row.day)
# Extract month from the date column
ufo_copy["month"] = ufo_copy["date"].apply(lambda row: row.month)
# Extract year from the date column
ufo_copy["year"] = ufo_copy["date"].apply(lambda row: row.year)
# Print the "date", "day", "month","year"
print("============================================================")
print("After extraction of the Day, Month and Year ufo_copy Dataset")
print("============================================================")
print(ufo_copy[["date", "day", "month","year"]].head())

Date column of the ufo_copy Dataset
0   2011-11-03 19:21:00
1   2004-10-03 19:05:00
3   2002-11-21 05:45:00
5   2012-06-16 23:00:00
6   2009-07-12 21:30:00
Name: date, dtype: datetime64[ns]
After extraction of the Day, Month and Year ufo_copy Dataset
                 date  day  month  year
0 2011-11-03 19:21:00    3     11  2011
1 2004-10-03 19:05:00    3     10  2004
3 2002-11-21 05:45:00   21     11  2002
5 2012-06-16 23:00:00   16      6  2012
6 2009-07-12 21:30:00   12      7  2009


In [14]:
# Printing the columns of the ufo_copy DataSet
print("====================================")
print("Columns list in the ufo_copy Dataset")
print("====================================")
print(ufo_copy.columns)

# Find the correlation between the columns
print("=============================================================================")
print("Finding the correlation between seconds, extracted_minutes, seconds_after_log")
print("=============================================================================")
print(ufo_copy[["seconds","extracted_minutes","seconds_after_log"]].corr())

# Remove the redundant columns
# Create a list of redundant column names to drop
to_drop = ["city", "country", "date", "desc","extracted_minutes", "lat","length_of_time","long","minutes", "recorded","seconds","state"]

# Drop those columns from the dataset
ufo_copy_subset = ufo_copy.drop(to_drop, axis=1)

# # Printing the columns of the ufo_copy_subset DataSet
print("===========================================")
print("Columns list in the ufo_copy_subset Dataset")
print("===========================================")
print(ufo_copy_subset.columns)

Columns list in the ufo_copy Dataset
Index(['date', 'city', 'state', 'country', 'type', 'seconds', 'length_of_time',
       'desc', 'recorded', 'lat', 'long', 'extracted_minutes', 'minutes',
       'seconds_after_log', 'country_code_enc', 'changing', 'chevron', 'cigar',
       'circle', 'cone', 'cross', 'cylinder', 'diamond', 'disk', 'egg',
       'fireball', 'flash', 'formation', 'light', 'other', 'oval', 'rectangle',
       'sphere', 'teardrop', 'triangle', 'unknown', 'day', 'month', 'year'],
      dtype='object')
Finding the correlation between seconds, extracted_minutes, seconds_after_log
                    seconds  seconds_after_log
seconds            1.000000           0.177998
seconds_after_log  0.177998           1.000000
Columns list in the ufo_copy_subset Dataset
Index(['type', 'seconds_after_log', 'country_code_enc', 'changing', 'chevron',
       'cigar', 'circle', 'cone', 'cross', 'cylinder', 'diamond', 'disk',
       'egg', 'fireball', 'flash', 'formation', 'light', 'othe

## 9. Split the X and y using train_test_split, setting stratify = y

In [15]:
X = ufo_copy_subset.drop(["type"],axis = 1)
y = ufo_copy_subset["type"].astype(str)
from sklearn.model_selection import train_test_split 
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)


## 10. Fit knn to the training sets and print the score of knn on the test sets

In [16]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
# Fit knn to the training sets
knn.fit(train_X, train_y)
# Print the score of knn on the test sets
print("======================")
print("Score of the test sets")
print("======================")
print(knn.score(train_X, train_y))

Score of the test sets
0.5146848137535817


In [17]:
from sklearn.model_selection import train_test_split  
from sklearn.naive_bayes import GaussianNB
# Split the dataset according to the class distribution of desc
y = ufo_copy["type"].astype(str)
train_X, test_X, train_y, test_y = train_test_split(transform_text_tfidf_v.toarray(), y, stratify = y)
nb = GaussianNB()
# Fit the model to the training data
nb.fit(train_X, train_y)
print("======================")
print("Score of the test sets")
print("======================")
# Print out the model's accuracy
print(nb.score(test_X, test_y))

Score of the test sets
0.14500537056928034
