# Introduction #
This notebook is being used for the development of a data classifier for CalPolyDnD's capstone project. The classifier needs to be able to determine what kind of data is in each column of a dataset.

In [1]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import LinearSVC

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy

import joblib as jl

### Data Transformation and Cleaning ###

The first step in solving this problem is reshaping the dataset so that we can make conclusions about each column rather than each row like in typical machine learning problems.

In [2]:
df = pd.read_csv("data/training_data.csv")
df.head()

Unnamed: 0,first_name,last_name,email,gender,ip_address,phone_number,home_address,ssn
0,Jill,Mish,jmish0@sbwire.com,Female,107.233.251.179,364-613-4322,593 Dryden Park,875-50-3238
1,Frasquito,Hamer,fhamer1@washingtonpost.com,Male,150.174.167.55,613-810-5889,842 Westport Trail,109-37-5461
2,Maddie,Muggleston,mmuggleston2@amazon.co.jp,Female,140.184.222.179,158-740-4562,8 Ridgeview Terrace,618-05-4668
3,Selia,Wiffield,swiffield3@edublogs.org,Female,75.169.255.200,472-249-0728,4890 Acker Lane,677-86-3312
4,Raleigh,Ianetti,rianetti4@weebly.com,Male,222.183.109.137,657-609-7297,12827 Barnett Plaza,235-54-2795


To do this we transpose the data and add a new column to this new dataframe that contains the category of data that the row contained.

In [3]:
df = df.transpose()
df['category'] = df.index
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1991,1992,1993,1994,1995,1996,1997,1998,1999,category
first_name,Jill,Frasquito,Maddie,Selia,Raleigh,Kipp,Flossi,Kaia,Carolann,Cam,...,Gaston,Gennie,Keen,Myranda,Regina,Wolf,Joanna,Bianca,Jabez,first_name
last_name,Mish,Hamer,Muggleston,Wiffield,Ianetti,Abercromby,Volkers,Flisher,Holttom,Adacot,...,Fawdry,Guichard,McTrustram,Marchent,Aberkirdo,Bithell,Laurie,Shadrack,Pretious,last_name
email,jmish0@sbwire.com,fhamer1@washingtonpost.com,mmuggleston2@amazon.co.jp,swiffield3@edublogs.org,rianetti4@weebly.com,kabercromby5@squidoo.com,fvolkers6@jugem.jp,kflisher7@list-manage.com,cholttom8@msn.com,cadacot9@dyndns.org,...,gfawdryrj@kickstarter.com,gguichardrk@oakley.com,kmctrustramrl@smugmug.com,mmarchentrm@twitpic.com,raberkirdorn@loc.gov,wbithellro@mediafire.com,jlaurierp@1und1.de,bshadrackrq@issuu.com,jpretiousrr@scribd.com,email
gender,Female,Male,Female,Female,Male,Female,Female,Female,Female,Male,...,Male,Female,Male,Female,Female,Male,Female,Female,Male,gender
ip_address,107.233.251.179,150.174.167.55,140.184.222.179,75.169.255.200,222.183.109.137,234.67.16.236,102.246.26.217,247.72.238.49,213.218.116.96,132.14.208.121,...,159.240.21.93,212.139.132.144,240.52.253.96,213.199.184.129,53.114.178.100,165.156.92.220,227.166.153.18,83.220.101.63,36.224.203.141,ip_address
phone_number,364-613-4322,613-810-5889,158-740-4562,472-249-0728,657-609-7297,552-683-0589,843-241-0936,725-953-9021,937-620-9140,622-126-0293,...,244-995-8457,691-992-1934,876-708-1212,926-325-4287,780-657-3624,252-569-9522,654-395-7627,785-969-9506,938-898-2068,phone_number
home_address,593 Dryden Park,842 Westport Trail,8 Ridgeview Terrace,4890 Acker Lane,12827 Barnett Plaza,16 Bellgrove Pass,10 Grim Hill,5928 Welch Lane,27 Atwood Parkway,42914 Huxley Drive,...,6 Corry Lane,3 Coolidge Point,98011 Buhler Trail,30 Independence Court,500 Roxbury Park,8 Calypso Alley,810 1st Alley,293 5th Way,8 Waubesa Court,home_address
ssn,875-50-3238,109-37-5461,618-05-4668,677-86-3312,235-54-2795,292-70-7424,508-52-0991,331-74-2147,211-70-1621,104-81-8368,...,364-37-6587,680-21-0177,399-68-5810,502-84-6796,686-30-5830,424-47-7300,236-63-1431,464-91-8544,580-15-5838,ssn


After this, we need to melt the dataframe so that it has a long shape instead of a wide shape. By using the 'category' column that we added earlier as the id variable for the melt, we end up with a new dataframe that has the schema ('category', 'variable', 'value'). The 'variable' variable here is useless to us so we can drop it from the dataframe. Now we have a dataframe that has the schema ('category', 'value').

In [4]:
df = df.melt(id_vars=['category'])
df = df.drop('variable', axis=1)
df.head(10)

Unnamed: 0,category,value
0,first_name,Jill
1,last_name,Mish
2,email,jmish0@sbwire.com
3,gender,Female
4,ip_address,107.233.251.179
5,phone_number,364-613-4322
6,home_address,593 Dryden Park
7,ssn,875-50-3238
8,first_name,Frasquito
9,last_name,Hamer


Using this new dataframe, we have extactly what we need to train our model, but we still need to check for missing/NaN values in the dataset.

<h2><center>ADD SOME CODE FOR DATA CLEANING HERE </center></h2>

# Building the Model #

After cleaning and reshaping our training dataset, we can start building our model. Lets take a quick look at our data.

In [5]:
df.head(10)

Unnamed: 0,category,value
0,first_name,Jill
1,last_name,Mish
2,email,jmish0@sbwire.com
3,gender,Female
4,ip_address,107.233.251.179
5,phone_number,364-613-4322
6,home_address,593 Dryden Park
7,ssn,875-50-3238
8,first_name,Frasquito
9,last_name,Hamer


To build our model, we can't use strings which is all we have in our dataset. So we need to convert these strings into some sort of number that the model will be able to do math with.

Enter scikit-learn's HashingVectorizer. This vectorizer performs a hashing algorithm on the data you pass into it and gives us a sparse matrix of floats that describe each string. By default, the HashingVectorizer does its vectorization assuming there are $ 2^{20} $ features. According to the documentation this is usually a good number of features for text classification.

In [6]:
hash_vect = HashingVectorizer()
jl.dump(hash_vect, 'vectorizer.joblib')

['vectorizer.joblib']

The next step is to extract our dependent and independent variables from the dataframe. Our independent variable(X) will be the 'value' column of our dataframe and our dependent variable(y) will be the 'category' column of the dataframe. However, since our independent variable has to be numerical, we need to use our vectorizer to vectorize the strings.

In [7]:
df['value'] = df['value'].astype(str)

In [8]:
X = hash_vect.fit_transform(df['value'])
y = df['category']

After extracting our variables we can create a test/train split to build our model.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Based on scikit-learn's machine learning cheat sheet, we decided that a LinearSVC would most likely be the best model for this problem.

So now we can create the model and train it using our training data.

In [30]:
clf = LinearSVC()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8484848484848485

In [32]:
jl.dump(clf, 'model.joblib')

['model.joblib']

Here we can see that our model performs pretty well out of the box, with an accuracy around 75%. However, we haven't done any hyperparameterization yet. In later sprints, we will refine this model using cross-validation/grid search/etc.

## Building Classification Objects ##

When we are given new data, we will perform the same sort of data massaging as we did when building our model. However, we will use the model we built earlier to create a new column in the massaged dataframe. This column will include the 'guess' that our model made for that value.

In [31]:
df = pd.read_csv('test_data.csv')
df_orig = df
df = df.transpose()
df['category'] = df.index
df = df.drop('id').melt(id_vars=['category']).drop('variable', axis=1)
df.head()

FileNotFoundError: File b'test_data.csv' does not exist

<h4>NOTE: This dataset has different column names for fname and lname</h4>

In [154]:
df['guess'] = clf.predict(hash_vect.fit_transform(df['value']))

In [155]:
df.head(10)

Unnamed: 0,category,value,guess
0,firstname,Giacomo,lname
1,lastname,Pulford,lname
2,email,gpulford0@wsj.com,email
3,gender,Male,gender
4,ip_address,74.141.110.4,ip_address
5,phone,934-562-0496,phone
6,address,6144 Bonner Pass,address
7,ssn,446-52-7673,money
8,firstname,Virge,fname
9,lastname,Le Grove,lname


Using this new column, we will need to calculate some sort of 'precision' score for each 'category' in the original dataset. Using this precision score, we can determine whether a 'category' belongs to one of the categories that our model can predict. If a category's 'precision' score is too low, then we can consider it as being some sort of new category of data that we haven't encountered yet.

Using this new data frame we can group the guesses together to present our findings to the user.

In [156]:
df.head()

Unnamed: 0,category,value,guess
0,firstname,Giacomo,lname
1,lastname,Pulford,lname
2,email,gpulford0@wsj.com,email
3,gender,Male,gender
4,ip_address,74.141.110.4,ip_address


We want to build a frequency matrix for 

In [157]:
result = {}
for t in df.itertuples():
    actual = t.category
    guess = t.guess
    
    if actual not in result.keys():
        result[actual] = {}
        
    if guess not in result[actual].keys():
        result[actual][guess] = 1
    else:
        result[actual][guess] = result[actual][guess] + 1

result

{'firstname': {'lname': 853, 'fname': 143, 'address': 1, 'credit_card': 3},
 'lastname': {'lname': 989,
  'fname': 5,
  'email': 4,
  'ssn': 1,
  'credit_card': 1},
 'email': {'email': 1000},
 'gender': {'gender': 1000},
 'ip_address': {'ip_address': 956, 'money': 44},
 'phone': {'phone': 838, 'ssn': 129, 'ip_address': 26, 'lname': 7},
 'address': {'address': 1000},
 'ssn': {'money': 396,
  'ssn': 433,
  'ip_address': 143,
  'phone': 24,
  'fname': 1,
  'lname': 2,
  'email': 1}}

In [158]:
res_df=pd.DataFrame(result).fillna(0)

In [159]:
res_df.apply(max)

firstname      853.0
lastname       989.0
email         1000.0
gender        1000.0
ip_address     956.0
phone          838.0
address       1000.0
ssn            433.0
dtype: float64

In [160]:
res_df

Unnamed: 0,firstname,lastname,email,gender,ip_address,phone,address,ssn
address,1.0,0.0,0.0,0.0,0.0,0.0,1000.0,0.0
credit_card,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
email,0.0,4.0,1000.0,0.0,0.0,0.0,0.0,1.0
fname,143.0,5.0,0.0,0.0,0.0,0.0,0.0,1.0
gender,0.0,0.0,0.0,1000.0,0.0,0.0,0.0,0.0
ip_address,0.0,0.0,0.0,0.0,956.0,26.0,0.0,143.0
lname,853.0,989.0,0.0,0.0,0.0,7.0,0.0,2.0
money,0.0,0.0,0.0,0.0,44.0,0.0,0.0,396.0
phone,0.0,0.0,0.0,0.0,0.0,838.0,0.0,24.0
ssn,0.0,1.0,0.0,0.0,0.0,129.0,0.0,433.0


In [161]:
freq = res_df.apply(max)/res_df.apply(sum)
freq

firstname     0.853
lastname      0.989
email         1.000
gender        1.000
ip_address    0.956
phone         0.838
address       1.000
ssn           0.433
dtype: float64

In [182]:
class Classification:

    def add_column(self, column):
        self.columns.append(column)

    def add_example(self, example):
        self.examples.append(example)

    def to_json(self):
        return f"""\
{{
    "name": "{self.name}",
    "columns": {self.columns},
    "examples": {self.examples}
}}

            """.replace("'",'"')
    
    def __eq__(self, other):
        return isinstance(other, self.__class__) and self.name == other.name
    
    def __str__(self):
        return f"""Name: {self.name}
            Columns: {self.columns}
            Examples: {self.examples}
        """
        
    def __init__(self, name):
        self.name = name
        self.examples = []
        self.columns = []
        
classification_map = {}
for k,v in freq.items():
    if v >= 0.75:
        classification_map[k] = [res_df[k].idxmax()]
    else:
        classification_map[k] = [res_df[k].idxmax(), res_df[k].drop(res_df[k].idxmax()).idxmax()]

classifications = []
for k, v in classification_map.items():
    for c in v:
        classification = Classification(c)
        if classification not in classifications:
            classification.add_column(k)
            for ex in df[df['category'] == k]['value'].head():
                classification.add_example(ex)
            classifications.append(classification)
for cls in classifications:
    print(cls.to_json())


{
    "name": "lname",
    "columns": ["firstname"],
    "examples": ["Giacomo", "Virge", "Fielding", "Zebadiah", "Chrotoem"]
}

            
{
    "name": "email",
    "columns": ["email"],
    "examples": ["gpulford0@wsj.com", "vlegrove1@reference.com", "fbaiyle2@imageshack.us", "zrichardot3@spotify.com", "coleszkiewicz4@vkontakte.ru"]
}

            
{
    "name": "gender",
    "columns": ["gender"],
    "examples": ["Male", "Male", "Male", "Male", "Male"]
}

            
{
    "name": "ip_address",
    "columns": ["ip_address"],
    "examples": ["74.141.110.4", "220.235.6.159", "222.205.231.178", "9.90.228.199", "35.76.198.142"]
}

            
{
    "name": "phone",
    "columns": ["phone"],
    "examples": ["934-562-0496", "736-696-7582", "149-779-8128", "383-641-4571", "215-835-7270"]
}

            
{
    "name": "address",
    "columns": ["address"],
    "examples": ["6144 Bonner Pass", "223 Loftsgordon Plaza", "483 Oneill Place", "99485 Paget Parkway", "2142 Anzinger Plaza"]
