# "ML 전처리"
> "기본적인 머신러닝 전처리"

- toc:true
- branch: master
- badges: true
- comments: true
- author: DataCamp & 재언
- categories: [jupyter, ml, ml preprocess, machine learning]

In [None]:
import pandas as pd

In [None]:
volunteer = pd.read_csv('https://assets.datacamp.com/production/repositories/1816/datasets/668b96955d8b252aa8439c7602d516634e3f015e/volunteer_opportunities.csv')
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,amsl,amsl_unit,org_title,org_content_id,addresses_count,locality,region,postalcode,primary_loc,display_url,recurrence_type,hours,created_date,last_modified_date,start_date_date,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,,,Center For NYC Neighborhoods,4426,1,,NY,,,/opportunities/4996,onetime,0,January 13 2011,June 23 2011,July 30 2011,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,,,Bpeace,37026,1,"5 22nd St\nNew York, NY 10010\n(40.74053152272...",NY,10010.0,,/opportunities/5008,onetime,0,January 14 2011,January 25 2011,February 01 2011,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,,,Street Project,3001,1,,NY,10026.0,,/opportunities/5016,onetime,0,January 19 2011,January 21 2011,January 29 2011,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,,,Oxfam America,2170,1,,NY,2114.0,,/opportunities/5022,ongoing,0,January 21 2011,January 25 2011,February 14 2011,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,,,Office of Recycling Outreach and Education,36773,1,,NY,10455.0,,/opportunities/5055,onetime,0,January 28 2011,February 01 2011,February 05 2011,February 05 2011,approved,,,,,,,,


## Missing data 

### Columns

In [None]:
print(len(volunteer.columns))
print(len(volunteer.dropna(axis=1, thresh=3).columns)) # missing vlaue가 3개 이상인 열 제거

35
24


### Row

In [None]:
print(volunteer.category_desc.isna().sum())
volunteer_subset = volunteer[volunteer.category_desc.notnull()] # missing value가 없는 행만 추출
print(volunteer_subset.shape)

48
(617, 35)


### Working with data types

In [None]:
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

### Class distribution

* Default splitting parameters of train_test_split will work well in many scenarios.
* However, if your label have an uneven distribution, yout test and training sets might not be representative samples of your dataset and could bias the model you're trying to train.
* A good technique for sampling more accurately when you have inbalanced classes is **stratified sampling**, which is a way of sampling that takes into account the distribution of classes or features in your dataset.

In [None]:
volunteer.category_desc.value_counts() ## Environment and Emergency Preparedness is less than 50

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
volunteer_X = volunteer.drop('category_desc', axis = 1)
volunteer_y = volunteer[['category_desc']]

X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y) # 75%
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


## Standardizing data

* Lots of numerical noise in data, such as lots of variance or differently-scaled data.
    * The preprocessing solution for that is **Standardization**, which is a method used to transform continuous data to make it look normally distributed.
* Scikit-learn models assume normally distributed data.
    * If it isn't, you risk biasing your model.
* If you're working with any kind of model that uses a linear distance metric or operates in a linear space like a K-nearest neighbors, linear regression, or K-means clustering, the model is assuming the data and features you're giving it are related in a linear fashion, or can be measured with a linear distance metric.
    * The case whan a feature or features in your dataset have a high variance is related to this.
    * This could bias a model that assumes the data is normally distributed.
* If a feature in your dataset has a variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.
* Modeling a dataset that contains continuous features that are on different sacles is another scenario to watch out for.
    * e.g. consider a dataset that contains a column related to height and another related to weight. In order to compare these features, they must be in the same linear space, and therefore must be standardized in some way.

In [None]:
wine = pd.read_csv('https://assets.datacamp.com/production/repositories/1816/datasets/9bd5350dfdb481e0f94eeef6acf2663452a8ef8b/wine_types.csv')
wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [None]:
X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine[['Type']]

In [None]:
from sklearn.neighbors import KNeighborsClassifier 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
                           weights='uniform')
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))

0.6444444444444445


  """


### Log normalization
* A method for standardizing your data that can be useful when you have a paricular column with high variance
* Above exampe, training a K-nearst neighbors classifier on that subset of the wine dataset didn't get a high accuracy score.
    * This is because within that subset, the Proline column has extremely high variance, which is affecting the accuracy of the classifier.
* Log normalization applies a log transformation to your values, which transforms your values onto a scale that approximates normality.
* Takes natural log of each number in the lefthand column, which is simply the exponent you would raise above the mathematical constant *e*(approximately equal to 2.718)to get that number.
    * e.g the log of 30 is equal to 3.4, because *e* to the power of 3.4 equals to 30
* A good strategy when you care about relative changes in a linear model, when you want to capture the magnitude of change, and when you want to keep everything in the positive space.
* A nice way to minimize the variance of a column and make it comparable to other columns for modeling.
* Use numpy

In [None]:
import numpy as np

In [None]:
np.exp(3.4), np.log(30)

(29.96410004739701, 3.4011973816621555)

In [None]:
wine.var() ## Proline needs to be standardized

Type                                0.600679
Alcohol                             0.659062
Malic acid                          1.248015
Ash                                 0.075265
Alcalinity of ash                  11.152686
Magnesium                         203.989335
Total phenols                       0.391690
Flavanoids                          0.997719
Nonflavanoid phenols                0.015489
Proanthocyanins                     0.327595
Color intensity                     5.374449
Hue                                 0.052245
OD280/OD315 of diluted wines        0.504086
Proline                         99166.717355
dtype: float64

In [None]:
wine['Proline_log'] = np.log(wine['Proline'])
print(wine['Proline_log'].var())

0.17231366191842018


### Scaling data for feature comparison(Feature scaling)
* A method of standardization that's most useful when you're working with a dataset that contains continuous features that are on different scales and you're using a model that operates in some sort of linear space(like linear regression)
* Feature scaling transforms the features so they have a mean of 0 and a variance of 1.
    * Make it easier to linearly compare features.
* StandardScaler object is that you can apply the same transformation on other data, like a test set, or new data that's part of the same set, for example, without having to rescale everything.

* Want to use the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. 

In [None]:
wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
count,178.0,178.0,178.0
mean,2.366517,19.494944,99.741573
std,0.274344,3.339564,14.282484
min,1.36,10.6,70.0
25%,2.21,17.2,88.0
50%,2.36,19.5,98.0
75%,2.5575,21.5,107.0
max,3.23,30.0,162.0


In [None]:
from sklearn.preprocessing import StandardScaler

* In scikit-learn, running fit_transform during preprocessing will both fit the method to the data as well as transform the data in a single step.

In [None]:
ss = StandardScaler()
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]
wine_subset_scaled = ss.fit_transform(wine_subset)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


### Compare the difference in model performance between scaled and unscaled data
* K-nearset neighbors is a model that classifies data based on its distance to training set data.
* A new data point is assigned a label based on the class that the majority of surrounding data points belong to.

In [None]:
X = wine[['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium',
       'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',
       'Proanthocyanins', 'Color intensity', 'Hue',
       'OD280/OD315 of diluted wines', 'Proline']]
y = wine[['Type']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))

0.6222222222222222


  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))

0.9777777777777777


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  


## Feature engineering

* Feature engineering is the creation of new features based on existing features, and it adds information to your dataset that is useful in some way: it adds features uesful for your prediction or clustering task, or it sheds insight into relationships between features.
* In real world data, you'll likely have to extracy and expand information that exists in the columns in your dataset.
* it is also something that is very dependent on the particular dataset you're analyzing
* e.g Timestamps can be broken into days or months, and headlines can be used for natural language processing.

In [None]:
volunteer[['title', 'created_date', 'category_desc']].head()

Unnamed: 0,title,created_date,category_desc
0,Volunteers Needed For Rise Up & Stay Put! Home...,January 13 2011,
1,Web designer,January 14 2011,Strengthening Communities
2,Urban Adventures - Ice Skating at Lasker Rink,January 19 2011,Strengthening Communities
3,Fight global hunger and support women farmers ...,January 21 2011,Strengthening Communities
4,Stop 'N' Swap,January 28 2011,Environment


### Encoding categorical variables
* Because models in scikit-learn require numerical input, if your dataset contains categorical variables, you'll have to encode them

#### Binary variable
* Use apply method(.apply(lambda~)
* LabelEncoder

#### One-hot-encodig
* Encodes categorical variable into 1s and 0s when you have more than two variables to encode.
* Transforming each value into an array.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
hiking = pd.read_json('https://assets.datacamp.com/production/repositories/1816/datasets/4f26c48451bdbf73db8a58e226cd3d6b45cf7bb5/hiking.json')
hiking.head()

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,
3,B073,Peninsula,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Discover how the Peninsula has changed over th...,N,N,,
4,B073,Waterfall,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Trace the source of the Lake on the Waterfall ...,N,N,,


In [None]:
enc = LabelEncoder()

hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

print(hiking[['Accessible', 'Accessible_enc']].head())

  Accessible  Accessible_enc
0          Y               1
1          N               0
2          N               0
3          N               0
4          N               0


In [None]:
category_enc = pd.get_dummies(volunteer['category_desc'])

print(category_enc.head())

   Education  Emergency Preparedness  Environment  Health  \
0          0                       0            0       0   
1          0                       0            0       0   
2          0                       0            0       0   
3          0                       0            0       0   
4          0                       0            1       0   

   Helping Neighbors in Need  Strengthening Communities  
0                          0                          0  
1                          0                          1  
2                          0                          1  
3                          0                          1  
4                          0                          0  


### Engineering numerical features
#### Aggregate statistics
* A common method of feature engineering is to take an aggregate of a set of numbers to use in place of those features
    * This can be helpful in reducing dimensionality of your feature space 
    * Perhaps you don't need multiple similar values that are close in distance to each other.
* Date and timestamps are another area where you might want to reduce granuality in dataset.
    * In a prediction task, you may need high-level information like the month or the year , or both.

In [None]:
volunteer["start_date_converted"] = pd.to_datetime(volunteer.start_date_date)

volunteer["start_date_month"] = volunteer["start_date_converted"].apply(lambda row: row.month)

print(volunteer[['start_date_converted', 'start_date_month']].head())

  start_date_converted  start_date_month
0           2011-07-30                 7
1           2011-02-01                 2
2           2011-01-29                 1
3           2011-02-14                 2
4           2011-02-05                 2


#### Text classification
* The way extracting from strings is using regular expressions.
    * Regex is patterns that can be used to extract patterns from text data.
* **Vectorizing text**
    * tf/idf is a way of vectorizing text that reflects how important a word is in a document beyond how frequently it occurs.
    * It stands for term frequency(tf) inverse document frequency(idf) and places the weight on words that are ultimately more significant in the entire corpus of words.
* Now we have a vectorized version of text, we can use it for classification.
![pic](./naivebayes.png)
    * Use a naive Bayes classifier, which is based on Bayes' theorm of conditional probability
    * Performs well on text classification tasks!
    * Naive Bayes treats each feature as independent from the others, which can be a naive assumption, but this work out well on text data.
    * Because each featrue is treated independently, this classifier works well on high-dimensional data and is very efficient.

In [None]:
import re

def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group())
        

hiking["Length_num"] = hiking["Length"].map(lambda row: return_mileage(str(row)))
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


In [None]:
def return_m(length):
    pattern = re.compile(r"\d+\.\d+")
    
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        print(float(mile.group()))
        

In [None]:
pattern = re.compile(r"\d+\.\d+")
mile = re.match(pattern, hiking.Length[0])
mile.group()

'0.8'

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
volunteer = volunteer[['title', 'category_desc']].dropna(axis=0)

title_text = volunteer['title']

tfidf_vec = TfidfVectorizer()

text_tfidf = tfidf_vec.fit_transform(title_text)

In [None]:
from sklearn.naive_bayes import GaussianNB

y = volunteer['category_desc']

X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

nb = GaussianNB()
nb.fit(X_train, y_train)
print(nb.score(X_test, y_test))

0.5161290322580645


## Feature selection

* A method of selecting features from your feature set to be used for modeling.
* Draws from a set of existing features, so it's different than feature engineering because it doesn't create new features.
* Sometiems it helps to get rid of noise in your model
    * features that strongly statistically correlated, which breaks the assumption of certain models and thus impact model performance.
* The overarching goal of feature selection is to improve model's performance.
    * Perhaps your exisiting feature set is much too large, or some of the features you're working with are unneccessary.
* Scikit-learn has several methods for automated feature selection, such as choosing a variance threshold and using univariate statistical tests.

### Removing redundant features
* Remove noisy features
* Remove strongly correlated features
* Remove duplicated features

In [None]:
wine2 = wine[['Flavanoids', 'Total phenols', 'Malic acid',
              'OD280/OD315 of diluted wines', 'Hue']]

In [None]:
wine2.corr()

Unnamed: 0,Flavanoids,Total phenols,Malic acid,OD280/OD315 of diluted wines,Hue
Flavanoids,1.0,0.864564,-0.411007,0.787194,0.543479
Total phenols,0.864564,1.0,-0.335167,0.699949,0.433681
Malic acid,-0.411007,-0.335167,1.0,-0.36871,-0.561296
OD280/OD315 of diluted wines,0.787194,0.699949,-0.36871,1.0,0.565468
Hue,0.543479,0.433681,-0.561296,0.565468,1.0


In [None]:
# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine2 = wine2.drop(to_drop, axis=1)

#### Selecting features using text vectors
* After you've vectorized your text, the vocabulary and weights will be stored in vectorizer.
* To pull out the vocabulary list, which you'll need in order to look at word weights, you can use the vocabulary attribute.
* Row data contains two components: the word weights and the index of the word.
* Before putting together the vocabulary, the word indices, and their weights, we want to reverse the key value pairs in the vocabulary.

In [None]:
vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}

In [None]:
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))

[189, 942, 466]


In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

In [None]:
# Split the dataset according to the class distribution of category_desc, using the filtered_text vector
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

0.567741935483871


#### Dimensionality reduction
* a less manual way of reducing the size of feature set
* a form of unsupervised learning that transforms data in a way that shrinks number of features in your feature space.
* This data transformation can be done in a linear or nonlinear fashion. (Combine/Decompose a feature space)
* a feature extraction method
* PCA(Principal Component Analysis) uses a linear transformation to project features into a space where they are completely uncorrelated.
* While the feature space is reduced, the variance is captured in a meaningful way by combining features into components.
    * PCA captures, in each component, as much of the variance in the dataset as possible.
* a useful method when you have a large number of features and no strong candidates for elimiantion.

##### PCA caveats
* it can be very difficult to interpret PCA components beyond which components explain the most variance.
* more of a black box method than other methods of dimensionality reduction
* a good step to do at the end of preprocessing journey, because of the way the data gets transformed and reshaped.

In [None]:
wine3 = wine[['Type', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash',
              'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',
              'Proanthocyanins', 'Color intensity', 'Hue',
              'OD280/OD315 of diluted wines', 'Proline']]

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
wine_X = wine3.drop("Type", axis=1)

transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


In [None]:
y = wine3['Type']

X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

knn.fit(X_wine_train, y_wine_train)

print(knn.score(X_wine_test, y_wine_test))

0.6888888888888889


## Case Study - UFOs sightings

### Change dtypes and drop missing values

In [None]:
ufo = pd.read_csv('https://assets.datacamp.com/production/repositories/1816/datasets/a5ebfe5d2ed194f2668867603b563963af4769e9/ufo_sightings_large.csv')
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
date              4935 non-null object
city              4926 non-null object
state             4516 non-null object
country           4255 non-null object
type              4776 non-null object
seconds           4935 non-null float64
length_of_time    4792 non-null object
desc              4932 non-null object
recorded          4935 non-null object
lat               4935 non-null object
long              4935 non-null float64
dtypes: float64(2), object(9)
memory usage: 424.2+ KB


In [None]:
ufo.dtypes

date               object
city               object
state              object
country            object
type               object
seconds           float64
length_of_time     object
desc               object
recorded           object
lat                object
long              float64
dtype: object

In [None]:
# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo["date"].dtypes)

datetime64[ns]


In [None]:
ufo.isna().sum()

date                0
city                9
state             419
country           680
type              159
seconds             0
length_of_time    143
desc                3
recorded            0
lat                 0
long                0
dtype: int64

In [None]:
# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo["length_of_time"].notnull() & 
                     ufo["state"].notnull() & 
                     ufo["type"].notnull()]

# Print out the shape of the new dataset
print(ufo.shape)
print(ufo_no_missing.shape)

(4935, 11)
(4283, 11)


In [None]:
ufo_no_missing.head(10)

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,2011-11-03 19:21:00,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,2004-10-03 19:05:00,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,2010-08-19 12:55:00,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,-117.156389
6,2009-07-12 21:30:00,duluth,mn,us,oval,600.0,total? maybe around 10 mi,A minor amber color trail&#44 (from where we w...,3/13/2012,46.7833333,-92.106389
7,2008-10-20 18:30:00,fairfield,tx,us,other,0.0,several sightings from 10,Multiple sightings in Central Texas (Freestone...,1/10/2009,31.7244444,-96.165
8,2013-06-09 00:00:00,oakville (canada),on,ca,light,120.0,2 minutes,Brilliant orange light or chinese lantern at o...,7/3/2013,43.433333,-79.666667
9,2013-04-26 23:27:00,lacey,wa,us,light,120.0,2 minutes,Bright red light moving north to north west fr...,5/15/2013,47.0344444,-122.821944
10,2013-09-13 20:30:00,ben avon,pa,us,sphere,300.0,5 minutes,North-east moving south-west. First 7 or so li...,9/30/2013,40.5080556,-80.083333


### Categorical variables and standardization

In [None]:
def search_minutes(time):
    
    pattern = re.compile(r"\d+ [min]+")

    m = re.search(pattern, time)
    
    if m is not None:
        return m.group()

ufo_no_missing['time_minutes'] = ufo_no_missing['length_of_time'].map(lambda x: search_minutes(str(x)))
ufo_no_missing.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long,time_minutes
0,2011-11-03 19:21:00,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111,
1,2004-10-03 19:05:00,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556,
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222,5 min
4,2010-08-19 12:55:00,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333,
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,-117.156389,10 min


In [None]:
ufo_df = ufo_no_missing.dropna(axis=0)
ufo_df.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long,time_minutes
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222,5 min
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,-117.156389,10 min
6,2009-07-12 21:30:00,duluth,mn,us,oval,600.0,total? maybe around 10 mi,A minor amber color trail&#44 (from where we w...,3/13/2012,46.7833333,-92.106389,10 mi
8,2013-06-09 00:00:00,oakville (canada),on,ca,light,120.0,2 minutes,Brilliant orange light or chinese lantern at o...,7/3/2013,43.433333,-79.666667,2 min
9,2013-04-26 23:27:00,lacey,wa,us,light,120.0,2 minutes,Bright red light moving north to north west fr...,5/15/2013,47.0344444,-122.821944,2 min


In [None]:
def return_minutes(time_string):
    
    pattern = re.compile(r"\d+")
    
    num = re.match(pattern, time_string)
    
    if num is not None:
        return int(num.group())
        
# Apply the extraction to the length_of_time column
ufo_df["minutes"] = ufo_df["time_minutes"].map(lambda x: return_minutes(str(x))).astype('float')

# Take a look at the head of both of the columns
print(ufo_df[['time_minutes', 'minutes']].head())

  time_minutes  minutes
3        5 min      5.0
5       10 min     10.0
6        10 mi     10.0
8        2 min      2.0
9        2 min      2.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


In [None]:
import numpy as np

print(ufo_df[['seconds', 'minutes']].var())

ufo_df['seconds_log'] = np.log1p(ufo_df['seconds'])

print(ufo_df['seconds_log'].var())

seconds    3.224621e+09
minutes    1.196762e+02
dtype: float64
1.1325251102467289


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


### Engineering new featuers

In [None]:
ufo_df["country_enc"] = ufo_df["country"].apply(lambda val: 1 if val == "us" else 0)

print(len(ufo_df["type"].unique()))

type_set = pd.get_dummies(ufo_df["type"])

ufo_df = pd.concat([ufo_df, type_set], axis=1)

ufo_df.head()

21


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,...,flash,formation,light,other,oval,rectangle,sphere,teardrop,triangle,unknown
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,...,0,0,0,0,0,0,0,0,1,0
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,...,0,0,1,0,0,0,0,0,0,0
6,2009-07-12 21:30:00,duluth,mn,us,oval,600.0,total? maybe around 10 mi,A minor amber color trail&#44 (from where we w...,3/13/2012,46.7833333,...,0,0,0,0,1,0,0,0,0,0
8,2013-06-09 00:00:00,oakville (canada),on,ca,light,120.0,2 minutes,Brilliant orange light or chinese lantern at o...,7/3/2013,43.433333,...,0,0,1,0,0,0,0,0,0,0
9,2013-04-26 23:27:00,lacey,wa,us,light,120.0,2 minutes,Bright red light moving north to north west fr...,5/15/2013,47.0344444,...,0,0,1,0,0,0,0,0,0,0


In [None]:
ufo_df.shape

(2147, 36)

In [None]:
ufo_df["month"] = ufo_df["date"].apply(lambda x: x.month)

ufo_df["year"] = ufo_df["date"].apply(lambda x: x.year)

print(ufo_df[['date', 'month', 'year']].head())

                 date  month  year
3 2002-11-21 05:45:00     11  2002
5 2012-06-16 23:00:00      6  2012
6 2009-07-12 21:30:00      7  2009
8 2013-06-09 00:00:00      6  2013
9 2013-04-26 23:27:00      4  2013


In [None]:
# Take a look at the head of the desc field
print(ufo_df['desc'].head())

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo_df['desc'])

# Look at the number of columns this creates
print(desc_tfidf.shape)

3    It was a large&#44 triangular shaped flying ob...
5    Dancing lights that would fly around and then ...
6    A minor amber color trail&#44 (from where we w...
8    Brilliant orange light or chinese lantern at o...
9    Bright red light moving north to north west fr...
Name: desc, dtype: object
(2147, 3730)


In [None]:
vocab = {v:k for k,v in vec.vocabulary_.items()}
vocab

{1824: 'it',
 3572: 'was',
 1908: 'large',
 150: '44',
 3417: 'triangular',
 2910: 'shaped',
 1449: 'flying',
 2336: 'object',
 1001: 'dancing',
 1964: 'lights',
 3284: 'that',
 3683: 'would',
 1448: 'fly',
 441: 'around',
 372: 'and',
 3290: 'then',
 2113: 'merge',
 1805: 'into',
 2378: 'one',
 1957: 'light',
 2154: 'minor',
 362: 'amber',
 861: 'color',
 3371: 'trail',
 1496: 'from',
 3616: 'where',
 3585: 'we',
 3598: 'were',
 1334: 'extremely',
 3011: 'slow',
 2214: 'movement',
 368: 'an',
 2359: 'odd',
 863: 'coloration',
 1968: 'like',
 3286: 'the',
 2643: 'quot',
 2511: 'phoenix',
 676: 'brilliant',
 2393: 'orange',
 2389: 'or',
 798: 'chinese',
 1902: 'lantern',
 461: 'at',
 1940: 'less',
 3283: 'than',
 15: '1000',
 1499: 'ft',
 2219: 'moving',
 1214: 'east',
 3339: 'to',
 3599: 'west',
 296: 'across',
 2330: 'oakville',
 2380: 'ontario',
 2135: 'midnight',
 1851: 'june',
 275: '9th',
 97: '2013',
 667: 'bright',
 2698: 'red',
 2296: 'north',
 1692: 'horizon',
 3325: 'till',
 

In [None]:
# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo_df[["seconds", "seconds_log", "minutes"]].corr())

# Make a list of features to drop   
to_drop = ["city", "country", "country_enc", "date", "desc", "lat", "length_of_time",
           "long", "minutes", "recorded", "seconds", "state", "time_minutes"]

# Drop those features
ufo_df_dropped = ufo_df.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

              seconds  seconds_log   minutes
seconds      1.000000     0.191504 -0.005390
seconds_log  0.191504     1.000000  0.817861
minutes     -0.005390     0.817861  1.000000


In [None]:
ufo_df_dropped.columns

Index(['type', 'seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone',
       'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash',
       'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
       'teardrop', 'triangle', 'unknown', 'month', 'year'],
      dtype='object')

In [None]:
X = ufo_df_dropped.drop('type', axis=1)
y = ufo_df_dropped['type']

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)

knn.fit(train_X, train_y)

print(knn.score(test_X, test_y))

0.6312849162011173


In [None]:
filtered_text = desc_tfidf[:, list(filtered_words)]

train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

nb = GaussianNB(priors=None)
nb.fit(train_X, train_y)

print(nb.score(test_X, test_y))

0.15828677839851024
