# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

So for this task we have 3 pipelines.
- Scapping Data from Reddit
- NLP Processing for our unstruvtured data
- Modeling & Predictions



### 1. Scrapping Data from Reddit


Our method for acquiring the data will be scraping the 'hot' threads as listed on the Reddit homepage.
We will scrap four features from the URL which are as follows.

1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

In [51]:
# Improrting required packages
import requests
import pandas as pd
import webbrowser
import os
import time
import pickle


LAST_CHECK_POINT_PATH = 'data/last_checkpoint.pck'
LAST_CHECK_POINT = None
try:
    with open(LAST_CHECK_POINT_PATH, 'rb') as fp:
        LAST_CHECK_POINT = pickle.load(fp)
except:
    LAST_CHECK_POINT = None
    pass

print("Last checkpoint was -> "+ str(LAST_CHECK_POINT))

Last checkpoint was -> t3_9ffj5f


Here we are trying to load last checkpoint of data.
Last checkpoint is defined as 'THE_AFTER_FROM_STEP_1 page' just because we have to follow these steps to gather more and more data.

1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/hot.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 


In [52]:
def nextUrl(lastpage):
    if lastpage is None:
        return "http://www.reddit.com/hot.json"
    else:
        return 'http://www.reddit.com/hot.json?after=' + lastpage

creating one function which will work dynamically with our code to get next reddit url

In [53]:
def getData(url):
    scrap_url = nextUrl(url)
    print('Connecting to reddit... ' + scrap_url)
    res = requests.get(scrap_url, headers={'User-agent': 'iView Labs Pvt. Ltd. 0.1'})
    data = res.json()
    LAST_CHECK_POINT = data['data']['after']
    return data, LAST_CHECK_POINT,

creating one another function which will return the data given in response of given url in json and previous page url which will act as 'LAST_CHECK_POINT' in our code.

In [54]:
def saveData(data):
    if data is None:
        print("Unable to read response...")
    else:
        df = pd.DataFrame(columns=['title', 'subreddit', 'created', 'num_comments'])
        children = data['data']['children']
        for i in range(len(children)):
            title = children[i]['data']['title']
            subreddit = children[i]['data']['subreddit']
            created = children[i]['data']['created']
            num_comments = children[i]['data']['num_comments']
            dict = {'title': title, 'subreddit': subreddit, 'created': created, 'num_comments': num_comments}
            df.loc[i] = pd.Series(dict)

        pd.concat([pd.DataFrame(data), df], ignore_index=True)

        last_save_df = None
        try:
            last_save_df = pd.read_csv('data/reddit_data.csv')
        except:
            pass

        if last_save_df is not None:
            appended = last_save_df.append(df)
            appended.to_csv('data/reddit_data.csv', index=False)
        else:
            df.to_csv('data/reddit_data.csv', index=False)

After getting the data we need to save it in csv with the our four features so doing some data binding here to scap data into csv.

In [56]:
def loadData(page=1, last_point=None):
    if (page > 1):
        for num in range(page):
            data, last_point = getData(last_point)
            saveData(data)
            print('Waiting for 3 seconds to proceed another request...')

            with open(LAST_CHECK_POINT_PATH, 'wb') as fp:
                if last_point is not None:
                    print('Checkpoint saved... ' + last_point)
                    pickle.dump(last_point, fp)

            time.sleep(3)  # sleeps 3 seconds before continuing
    else:
        data, last_point = getData(None)
        saveData(data)
    return last_point

After creating required logic function, now we need one driver function which will do our task of loading reddit data up to number of pages.

In this function we have added time.sleep(3) to wait 3 seconds before calling next url.

In [57]:
'''UNCOMMENT FOLLOWING LINES TO GET MORE DATA '''
num_pages = 10
# LAST_CHECK_POINT = loadData(num_pages, LAST_CHECK_POINT)

In [58]:
finalDf = pd.read_csv('data/reddit_data.csv')

finally... printing our scapped data

In [60]:
finalDf
df = finalDf
print(finalDf)

                                                  title  \
0         Both our dogs are smitten with the neighbour.   
1     Old people always tryna shit on the younger ge...   
2     Last year 920,000 children died of pneumonia, ...   
3                             Robbing a vape shop, wcgw   
4     Justin Trudeau to apologize Nov. 7 for 1939 de...   
5     4 whales swimming silently underneath this guy...   
6                          It really did turn out well😀   
7                              I understand completely.   
8     Dead Man's Cove at Cape Disappointment, WA [20...   
9     TIL the Celebrity Jeopardy sketch from SNL was...   
10                                     It's just a game   
11    My friend works at the car wash and this was t...   
12    This Opuntia ‘Pinta Rita’ cactus looks opalescent   
13                     I wish all fathers were like him   
14    Julia Louis-Dreyfus, 6 months after completing...   
15              Update on the goat, he’s in pajamas now 

##### Creating two features like cat is in title or dog is in title 

In [79]:
wordDf= pd.DataFrame(columns=['word'])

def is_cat_funny_in_title(text):
    if "cat" in text.lower():
        return 'cat'
    elif "funny" in text.lower():
        return 'funny'
    else:
        return 'na'

wordDf['word'] = df['title'].apply(lambda x: is_cat_funny_in_title(x))
df[wordDf['word'].str.contains("cat")]['title']

56            My brother and my cat have matching outfits
136                                      Thor on vacation
171     Not sure what to do about the cat. Keeps bring...
325                               Existential cat crisis.
383                                                  Cat.
397          Caught my grumpy cat loving on the new puppy
457     Trying to work with cats around can be challen...
637     My mum is knitting donkeys to raise money for ...
712          Caught my grumpy cat loving on the new puppy
846     Whoever decided on this location for the lambd...
874                  I swear there’s a cat in this photo.
877     Off to Houston to stand with Sam Young during ...
890     My cat got into a bathroom drawer and pushed i...
910     Heading to Houston to support Sam Young. The M...
1078    TIL according to Taika Waititi, 80% of the dia...
1100    Dad's shop cat had kittens. This one keeps fol...
1156                           Finland's education system
1157    These 

### 2. NLP Processing for our unstruvtured data

Now we have our four features but by observing above table we have some string data available and as we know that machines can not understand these strings so we have transform these string respresented data in to numerical form.

Here we can apply one hot encode to the subreddit but applying to title will not give proper result so for the title, we do nlp on that and trying to get one global matrix for title.

To do NLP on title, we have to follow the particular pipeline to get the major words from the title.
We are applying pipeline given as below.

- REMOVE PUNCTUATION
- TOKENIZATION
- REMOVING STOPWORDS
- STEMMING (IF NEEDED TO IMPROVE ACCURACY)
- VECTORIZING

##### 2.1 Remove Punctuation

In this method, We are removing all the following the punctuations from title


In [9]:
import string

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [10]:
def remove_punc(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

In [11]:
''' python is case sensetive for that A and a is diffrent thats why lower()'''
df['title_removed_punctuation'] = df['title'].apply(lambda x: remove_punc(x.lower()))
print(df.head())

                                               title           subreddit  \
0      Both our dogs are smitten with the neighbour.                 aww   
1  Old people always tryna shit on the younger ge...  BlackPeopleTwitter   
2  Last year 920,000 children died of pneumonia, ...       UpliftingNews   
3                          Robbing a vape shop, wcgw    Whatcouldgowrong   
4  Justin Trudeau to apologize Nov. 7 for 1939 de...           worldnews   

        created  num_comments  \
0  1.536308e+09           272   
1  1.536309e+09          1046   
2  1.536309e+09           208   
3  1.536314e+09           438   
4  1.536309e+09          2911   

                           title_removed_punctuation  
0       both our dogs are smitten with the neighbour  
1  old people always tryna shit on the younger ge...  
2  last year 920000 children died of pneumonia mo...  
3                           robbing a vape shop wcgw  
4  justin trudeau to apologize nov 7 for 1939 dec...  


##### 2.2 Tokenization

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.

In [12]:
import re


def tokenize(text):
    # Split word non word
    tokens = re.split('\W+', text)
    return tokens
##EACH ROW OF TITLE
df['title_tokenize'] = df['title_removed_punctuation'].apply(lambda x: tokenize(x))
print(df.head())

                                               title           subreddit  \
0      Both our dogs are smitten with the neighbour.                 aww   
1  Old people always tryna shit on the younger ge...  BlackPeopleTwitter   
2  Last year 920,000 children died of pneumonia, ...       UpliftingNews   
3                          Robbing a vape shop, wcgw    Whatcouldgowrong   
4  Justin Trudeau to apologize Nov. 7 for 1939 de...           worldnews   

        created  num_comments  \
0  1.536308e+09           272   
1  1.536309e+09          1046   
2  1.536309e+09           208   
3  1.536314e+09           438   
4  1.536309e+09          2911   

                           title_removed_punctuation  \
0       both our dogs are smitten with the neighbour   
1  old people always tryna shit on the younger ge...   
2  last year 920000 children died of pneumonia mo...   
3                           robbing a vape shop wcgw   
4  justin trudeau to apologize nov 7 for 1939 dec...   

       

So we got the our tokenized column as "title_tokenize"

##### 2.3 Remove StepWords

StepWords are those which are words which does not play importance in sentences.
Like I am going to watch movie. where 'I', 'am' and 'to' plays part of the stopwords.

Here we will use nltk package to remvoe stopwords.

In [13]:
import nltk
print('NLTK (DOWNLOAD ALL PACKAGES TO PERFORM NLP OPERATION)')

print('UNCOMMENT FOLLOWING LINE To GET NLTK DOWNLOADED')
nltk.download()
stopword = nltk.corpus.stopwords.words('english')


def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

NLTK (DOWNLOAD ALL PACKAGES TO PERFORM NLP OPERATION)
UNCOMMENT FOLLOWING LINE To GET NLTK DOWNLOADED
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [14]:
df['title_nostopwords'] = df['title_tokenize'].apply(lambda x: remove_stopwords(x))
print(df.head())

                                               title           subreddit  \
0      Both our dogs are smitten with the neighbour.                 aww   
1  Old people always tryna shit on the younger ge...  BlackPeopleTwitter   
2  Last year 920,000 children died of pneumonia, ...       UpliftingNews   
3                          Robbing a vape shop, wcgw    Whatcouldgowrong   
4  Justin Trudeau to apologize Nov. 7 for 1939 de...           worldnews   

        created  num_comments  \
0  1.536308e+09           272   
1  1.536309e+09          1046   
2  1.536309e+09           208   
3  1.536314e+09           438   
4  1.536309e+09          2911   

                           title_removed_punctuation  \
0       both our dogs are smitten with the neighbour   
1  old people always tryna shit on the younger ge...   
2  last year 920000 children died of pneumonia mo...   
3                           robbing a vape shop wcgw   
4  justin trudeau to apologize nov 7 for 1939 dec...   

       

As we see it here that in the 1st row of the column 'title_tokenize' there are 'both', 'our', 'are', 'with', 'the' etc. are the stopwords that not plays important part in sentence, so we are removing this and building new column name 'title_nostopwords'

Now we will create one method which will do the all the 3 steps in one shot.

In [15]:
def clean_text(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    tokens = re.split('\W+', text_nopunct)
    text = [word for word in tokens if word not in stopword]
    return text

df['title_clear'] = df['title'].apply(lambda x: clean_text(x.lower()))

#Removing unnecessary columns
del df['title_tokenize'], df['title_removed_punctuation']
df

Unnamed: 0,title,subreddit,created,num_comments,title_nostopwords,title_clear
0,Both our dogs are smitten with the neighbour.,aww,1.536308e+09,272,"[dogs, smitten, neighbour]","[dogs, smitten, neighbour]"
1,Old people always tryna shit on the younger ge...,BlackPeopleTwitter,1.536309e+09,1046,"[old, people, always, tryna, shit, younger, ge...","[old, people, always, tryna, shit, younger, ge..."
2,"Last year 920,000 children died of pneumonia, ...",UpliftingNews,1.536309e+09,208,"[last, year, 920000, children, died, pneumonia...","[last, year, 920000, children, died, pneumonia..."
3,"Robbing a vape shop, wcgw",Whatcouldgowrong,1.536314e+09,438,"[robbing, vape, shop, wcgw]","[robbing, vape, shop, wcgw]"
4,Justin Trudeau to apologize Nov. 7 for 1939 de...,worldnews,1.536309e+09,2911,"[justin, trudeau, apologize, nov, 7, 1939, dec...","[justin, trudeau, apologize, nov, 7, 1939, dec..."
5,4 whales swimming silently underneath this guy...,interestingasfuck,1.536303e+09,866,"[4, whales, swimming, silently, underneath, gu...","[4, whales, swimming, silently, underneath, gu..."
6,It really did turn out well😀,nonononoyes,1.536315e+09,73,"[really, turn, well, ]","[really, turn, well, ]"
7,I understand completely.,CrappyDesign,1.536310e+09,177,"[understand, completely]","[understand, completely]"
8,"Dead Man's Cove at Cape Disappointment, WA [20...",EarthPorn,1.536304e+09,285,"[dead, mans, cove, cape, disappointment, wa, 2...","[dead, mans, cove, cape, disappointment, wa, 2..."
9,TIL the Celebrity Jeopardy sketch from SNL was...,todayilearned,1.536302e+09,922,"[til, celebrity, jeopardy, sketch, snl, create...","[til, celebrity, jeopardy, sketch, snl, create..."


So we got the same colums with the clean_text method so removing 'title_nostopwords' column

In [16]:
del df['title_nostopwords']
df

Unnamed: 0,title,subreddit,created,num_comments,title_clear
0,Both our dogs are smitten with the neighbour.,aww,1.536308e+09,272,"[dogs, smitten, neighbour]"
1,Old people always tryna shit on the younger ge...,BlackPeopleTwitter,1.536309e+09,1046,"[old, people, always, tryna, shit, younger, ge..."
2,"Last year 920,000 children died of pneumonia, ...",UpliftingNews,1.536309e+09,208,"[last, year, 920000, children, died, pneumonia..."
3,"Robbing a vape shop, wcgw",Whatcouldgowrong,1.536314e+09,438,"[robbing, vape, shop, wcgw]"
4,Justin Trudeau to apologize Nov. 7 for 1939 de...,worldnews,1.536309e+09,2911,"[justin, trudeau, apologize, nov, 7, 1939, dec..."
5,4 whales swimming silently underneath this guy...,interestingasfuck,1.536303e+09,866,"[4, whales, swimming, silently, underneath, gu..."
6,It really did turn out well😀,nonononoyes,1.536315e+09,73,"[really, turn, well, ]"
7,I understand completely.,CrappyDesign,1.536310e+09,177,"[understand, completely]"
8,"Dead Man's Cove at Cape Disappointment, WA [20...",EarthPorn,1.536304e+09,285,"[dead, mans, cove, cape, disappointment, wa, 2..."
9,TIL the Celebrity Jeopardy sketch from SNL was...,todayilearned,1.536302e+09,922,"[til, celebrity, jeopardy, sketch, snl, create..."


##### 2.3 Vectorizing

Now we have to create one vector of this tokenize words so that we can easily fit this with our features so that we can easily create classification model to achieve our target.


In [17]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(df['title'])

X_counts.shape
# count_vect.get_feature_names()

(1250, 4170)

So we have 3324 uniq name in this 1000 rows

In [18]:
##Vectorizing output sparse matrix
X_counts_df = pd.DataFrame(X_counts.toarray())

##Assinging Names
X_counts_df.columns = count_vect.get_feature_names()
X_counts_df

Unnamed: 0,Unnamed: 1,0,010,08,09,09132018,0N,0VER,1,10,...,yield,yogurt,youll,younger,youre,youve,zero,zombie,zombieinducing,zoomies
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


finally we got our vectorized matrix for title column in X_counts_df dataframe.
Now we need to join other columns to these data frame

In [19]:
X_counts_df['subreddit'] = df['subreddit']
X_counts_df['created'] = df['created']
X_counts_df['num_comments'] = df['num_comments']
X_counts_df

Unnamed: 0,Unnamed: 1,0,010,08,09,09132018,0N,0VER,1,10,...,youll,younger,youre,youve,zero,zombie,zombieinducing,zoomies,subreddit,num_comments
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,aww,272
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,BlackPeopleTwitter,1046
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,UpliftingNews,208
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Whatcouldgowrong,438
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,worldnews,2911
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,interestingasfuck,866
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,nonononoyes,73
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,CrappyDesign,177
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,EarthPorn,285
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,todayilearned,922


Doing one hot encoding of <b>subreddit<b> column

In [20]:
finalDf = pd.get_dummies(X_counts_df, columns=['subreddit'])
finalDf

Unnamed: 0,Unnamed: 1,0,010,08,09,09132018,0N,0VER,1,10,...,subreddit_wholesomegreentext,subreddit_wholesomememes,subreddit_woahdude,subreddit_woof_irl,subreddit_woooosh,subreddit_worldnews,subreddit_xboxone,subreddit_yesyesyesyesno,subreddit_youseeingthisshit,subreddit_zelda
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we done with the nlp portion and we got our vectorized featured dataset

### 3. Modeling & Predictions

In [21]:
print(finalDf.shape)

# Getting Description
print(finalDf['num_comments'].describe())

print('Num of null in label: {}'.format(finalDf['num_comments'].isnull().sum()))

finalDf['num_comments'].head()

(1250, 4676)
count    1250.000000
mean      257.849600
std       596.880617
min         0.000000
25%        25.000000
50%        76.000000
75%       228.750000
max      7253.000000
Name: num_comments, dtype: float64
Num of null in label: 0


0     272
1    1046
2     208
3     438
4    2911
Name: num_comments, dtype: int64

We have avg. of 241 cmnts in overall blogs so we will transform num_columns to 0 or 1 based on this avg.

Like.. if our number of comments are greater than avg value then it will be <b>(HIGH) 1</b> else <b>(LOW) 0</b>

In [22]:
import numpy as np
avg = np.mean(finalDf['num_comments'])
def format_cmnts(num_cmnts, avg):
    if num_cmnts > avg:
        return 1 
    else: 
        return 0

finalDf['num_comments'] = finalDf['num_comments'].apply(lambda x: format_cmnts(x, avg))
finalDf['num_comments'].head()

0    1
1    1
2    0
3    1
4    1
Name: num_comments, dtype: int64

In [23]:
# # Create the X and y arrays
y = finalDf['num_comments'].values
del finalDf['num_comments']
X = finalDf.values

In [24]:
from sklearn.model_selection import train_test_split
# Split the data set in a training set (70%) and a test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, shuffle= True)

In [25]:
#Create Random Forest model

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=150)
random_forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [26]:
X_train.shape

(875, 4675)

In [27]:
X_test.shape

(375, 4675)

In [28]:
y_pred = random_forest.predict(X_test)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
print('Training Dataset Accuracy', acc_random_forest)
acc_random_forest_test = round(random_forest.score(X_test, y_test) * 100, 2)
print('Testing Dataset Accuracy', acc_random_forest_test)

Training Dataset Accuracy 99.89
Testing Dataset Accuracy 80.8


In [30]:
from sklearn.metrics import mean_squared_error
rmse2 = np.sqrt(mean_squared_error(y_test, y_pred))
print("rmse on testing data "+str(rmse2))

rmse on testing data 0.438178046004


Now using DecisionTreeClassifier for comparision

In [63]:
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
print('Training Dataset Accuracy', acc_decision_tree)
acc_decision_tree_test = round(decision_tree.score(X_test, y_test) * 100, 2)
print('Testing Dataset Accuracy', acc_decision_tree_test)

Training Dataset Accuracy 98.57
Testing Dataset Accuracy 83.78


In [64]:
from sklearn.metrics import mean_squared_error
rmse2 = np.sqrt(mean_squared_error(y_test, y_pred))
print("rmse on testing data "+str(rmse2))

rmse on testing data 0.40276819912


Now using LogisticRegression Model for comparison

In [65]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, y_train) * 100, 2)
print('Training Dataset Accuracy', acc_log)
acc_log_test = round(logreg.score(X_test, y_test) * 100, 2)
print('Testing Dataset Accuracy', acc_log_test)

Training Dataset Accuracy 76.1
Testing Dataset Accuracy 76.22


In [66]:
from sklearn.metrics import mean_squared_error
rmse2 = np.sqrt(mean_squared_error(y_test, y_pred))
print("rmse on testing data "+str(rmse2))

rmse on testing data 0.487624627944
