# PROJECT 3 - PART C:

## CLEANING OF KETO AND WINE PUSHSHIFT IO CAPTURES

In this notebook, we will explore each dataframe generated by our subreddit capture and prepare it for further analysis.

**ADDRESSING THE KETO DATAFRAME**

In [1]:
import pandas as pd
import numpy as np
import nltk
import json
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
df_keto_push = pd.read_csv('keto_push_output.csv')

In [3]:
df_keto_push.shape

(7900, 10)

In [4]:
df_keto_push.head()

Unnamed: 0.1,Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,0,It's been a hot minute,I keep getting off track when it comes to keto...,keto,1580160840,bbwsoontobebw,5,1,True,2020-01-27
1,1,Type 1 Diabetes on Keto,"Hello, I’m beginning my second month of keto. ...",keto,1580160905,iWantNotToWant,10,1,True,2020-01-27
2,2,"Week 3,5 and intense cravings for bread and rice","Week 3,5 and I am having intense cravings for ...",keto,1580161118,littleboo2theboo,10,1,True,2020-01-27
3,3,12-Week Challenge at the gym and looking for t...,"Hi guys,\n\nI am looking for some advice. I di...",keto,1580161503,chow_shepard,4,1,True,2020-01-27
4,4,"Does never going ""Full Keto"" hurt my health?",Hello!\n\nI'm a active highschooler who avoids...,keto,1580161639,OrganizingChaosBrb,3,1,True,2020-01-27


In [5]:
df_keto_push.drop(columns = ['Unnamed: 0'],inplace=True)
df_keto_push.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,It's been a hot minute,I keep getting off track when it comes to keto...,keto,1580160840,bbwsoontobebw,5,1,True,2020-01-27
1,Type 1 Diabetes on Keto,"Hello, I’m beginning my second month of keto. ...",keto,1580160905,iWantNotToWant,10,1,True,2020-01-27
2,"Week 3,5 and intense cravings for bread and rice","Week 3,5 and I am having intense cravings for ...",keto,1580161118,littleboo2theboo,10,1,True,2020-01-27
3,12-Week Challenge at the gym and looking for t...,"Hi guys,\n\nI am looking for some advice. I di...",keto,1580161503,chow_shepard,4,1,True,2020-01-27
4,"Does never going ""Full Keto"" hurt my health?",Hello!\n\nI'm a active highschooler who avoids...,keto,1580161639,OrganizingChaosBrb,3,1,True,2020-01-27


In [6]:
df_keto_push.isnull().sum()

title           0
selftext        0
subreddit       0
created_utc     0
author          0
num_comments    0
score           0
is_self         0
timestamp       0
dtype: int64

In [7]:
df_keto_push = df_keto_push.drop_duplicates(subset = ['title', 'selftext'], keep = 'first', inplace = False)

In [8]:
df_keto_push.shape

(7894, 9)

In [9]:
#df[df['model'].str.contains('ac')]
df_keto_push[df_keto_push['title'].str.contains('wine')]

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
2501,"A glass of wine every night, okay?",Is drinking a glass of wine every night okay? ...,keto,1577946414,asnsam,32,1,True,2020-01-02


In [10]:
df_keto_push[df_keto_push['selftext'].str.contains('wine')]

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
305,I made Porchetta,\n https://imgur.com/gallery/qxGQogt\n\n\n\nI ...,keto,1579915462,sarotto,8,1,True,2020-01-24
346,Can eating 25g of net carbs some days really s...,Hey all!\n\n&amp;#x200B;\n\nThis is keto take ...,keto,1579974049,liverly,28,1,True,2020-01-25
424,Is potassium bicarbonate fine in ketoade?,I really dislike the taste of chlorides (both ...,keto,1579874975,tb877,7,1,True,2020-01-24
425,To the people trying to make bread/pizza/pasta...,"Why. \n\nIf you’re going to do something, comm...",keto,1579875194,scorpppppion,14,1,True,2020-01-24
450,Facing my faults,I posted here a few weeks ago re: how importan...,keto,1579894879,clair_a_dactyl,7,1,True,2020-01-24
628,After Action Report - San Francisco Weekend,OMG it is difficult to be a keto (or even just...,keto,1579713853,grilladdict,2,1,True,2020-01-22
811,I'm tired. [Week 3],"I'm a habitual eater, so my meals are more or ...",keto,1579544545,squeeeeenis,13,1,True,2020-01-20
1191,Hi! Looking for people with similar journeys!,"Hi first time poster, long time lurker.\n\nI""m...",keto,1579039454,Rhodium68,13,1,True,2020-01-14
1302,Question to Keto skiers/snowboarders: What’s y...,I’m off skiing next month to Italy and this wi...,keto,1578951429,doegrey,20,1,True,2020-01-13
1316,Alcohol and MCT oil,Is it possible to simply put a serving of MCT ...,keto,1578959851,ballsdeepinmywine,2,1,True,2020-01-13


In [11]:
df_keto_push.shape

(7894, 9)

In [12]:
df_keto_push.to_csv('keto_push_clean.csv')

**ADDRESSING THE WINE DATAFRAME**

In [13]:
df_wine_push = pd.read_csv('wine_push_output.csv')

In [14]:
df_wine_push.shape

(2917, 10)

In [15]:
df_wine_push.head()

Unnamed: 0.1,Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,0,Looking for a summer job,"Hi,\n\nI'm a 23 year-old man from the Netherla...",wine,1580165102,Loirettoux,3,1,True,2020-01-27
1,1,California whites,"Whenever I try a Californian wine, it's always...",wine,1580170191,RaphGiroux,8,1,True,2020-01-27
2,2,Wine suggestions while in France (Provence),Will be traveling to France and spending most ...,wine,1580177480,irishmuse,4,1,True,2020-01-27
3,3,Show me the Munny Hunny | Wine Industry Report...,"If you are in the wine industry, congratulatio...",wine,1580177690,cudaeducation,0,1,True,2020-01-27
4,4,Going to Burgundy in May,"Hi, I'll be going to Burgundy during May and w...",wine,1580182920,GaanZi,5,1,True,2020-01-27


In [16]:
df_wine_push.drop(columns = ['Unnamed: 0'],inplace=True)
df_wine_push.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Looking for a summer job,"Hi,\n\nI'm a 23 year-old man from the Netherla...",wine,1580165102,Loirettoux,3,1,True,2020-01-27
1,California whites,"Whenever I try a Californian wine, it's always...",wine,1580170191,RaphGiroux,8,1,True,2020-01-27
2,Wine suggestions while in France (Provence),Will be traveling to France and spending most ...,wine,1580177480,irishmuse,4,1,True,2020-01-27
3,Show me the Munny Hunny | Wine Industry Report...,"If you are in the wine industry, congratulatio...",wine,1580177690,cudaeducation,0,1,True,2020-01-27
4,Going to Burgundy in May,"Hi, I'll be going to Burgundy during May and w...",wine,1580182920,GaanZi,5,1,True,2020-01-27


In [17]:
df_wine_push.isnull().sum()

title           0
selftext        0
subreddit       0
created_utc     0
author          0
num_comments    0
score           0
is_self         0
timestamp       0
dtype: int64

In [18]:
df_wine_push = df_wine_push.drop_duplicates(subset = ['title', 'selftext'], keep = 'first', inplace = False)

In [19]:
df_wine_push.shape

(2882, 9)

In [20]:
df_wine_push.to_csv('wine_push_clean.csv')