# Data Cleaning, Preprocessing, EDA, and Feature Engineering

### Imports

In [1]:
import pandas as pd
import regex as re 
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
import warnings
from bs4 import BeautifulSoup  
warnings.filterwarnings("ignore")

In [2]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


## Importing Data and Tidying up the DataFrame

In [3]:
climate_skeptics = './data/climateskeptics.csv'
climate_change = './data/climatechange.csv'
skeptics = pd.read_csv(climate_skeptics)
change = pd.read_csv(climate_change)


In [4]:
skeptics.head()

Unnamed: 0.1,Unnamed: 0,post_title,post_text,subreddit
0,0,,,
1,1,2007 NASA: Arctic Ice Free By 2013!!,,climateskeptics
2,2,Climate Hype is a Cover Up,,climateskeptics
3,3,"These are the suggestions when you type in ""cl...",,climateskeptics
4,4,chernobyl miniseries is global warming propaganda,"i thought it was anti-nuclear propaganda, whic...",climateskeptics


In [5]:
change.head()

Unnamed: 0.1,Unnamed: 0,post_title,post_text,subreddit
0,0,,,
1,1,Subreddit rules,Reddit's new look doesn't display the sidebar ...,climatechange
2,2,I'm afraid climate change is going to kill me!...,Feeling scared? Have you been listening to or ...,climatechange
3,3,"As Water Scarcity Increases, Desalination Plan...",,climatechange
4,4,Scientists have developed an interactive map d...,,climatechange


In [6]:
change.shape

(1237, 4)

In [7]:
skeptics.shape

(1231, 4)

In [8]:
#I need to drop the duplicate posts that I collected when gathering data, drop_duplicates should do that.
skeptics.drop_duplicates(inplace=True)
change.drop_duplicates(inplace=True)

In [9]:
#My DFs have redundant/dirty columns and rows, let's get rid of them.
skeptics.drop(columns = 'Unnamed: 0', index=0, inplace=True)
change.drop(columns = 'Unnamed: 0', index=0, inplace=True)

In [10]:
#Now let's reset the indexes.
skeptics.reset_index(drop=True, inplace=True)
change.reset_index(drop=True, inplace=True)

In [11]:
change.head()

Unnamed: 0,post_title,post_text,subreddit
0,Subreddit rules,Reddit's new look doesn't display the sidebar ...,climatechange
1,I'm afraid climate change is going to kill me!...,Feeling scared? Have you been listening to or ...,climatechange
2,"As Water Scarcity Increases, Desalination Plan...",,climatechange
3,Scientists have developed an interactive map d...,,climatechange
4,"Debunking ""climate skeptics,"" part 1. There is...",Hi guys! I'm thinking about making a series of...,climatechange


In [12]:
change.dtypes

post_title    object
post_text     object
subreddit     object
dtype: object

In [13]:
skeptics.dtypes

post_title    object
post_text     object
subreddit     object
dtype: object

#### ****My dtypes are all objects, which was expected. Once I start feature engineering and tokenizing, I will be binarizing all of these columns and changing them into int/float dtypes.****

In [14]:
change.shape

(1236, 3)

In [15]:
skeptics.shape

(1230, 3)

#### ***I've got a pretty respectable amount of Reddit posts here after dropping all my duplicates.  1761 rows of data should be enough to generate a model that can at least perform with more accuracy then the baseline score.***

## Data Cleaning 

In [16]:
change.isnull().sum()

post_title      0
post_text     659
subreddit       0
dtype: int64

In [17]:
skeptics.isnull().sum()

post_title       0
post_text     1013
subreddit        0
dtype: int64

#### ***I've got a large number of NaNs in my 'post_text' column. This was expected. What these NaNs represent reddit posts that had no text and instead had an image or a link to another website.***

In [18]:
# I'm Going to fill in my NaNs with blank text because I don't want to add any new words 
# That can potentially be counted as high weight words and potentially
# Hurt my model scores.
change.fillna(' ', inplace=True)
skeptics.fillna(' ', inplace=True)

In [19]:
#Lets call .head on our 'post_text' columns to take a closer look at them
change['post_text'].head()

0    Reddit's new look doesn't display the sidebar ...
1    Feeling scared? Have you been listening to or ...
2                                                     
3                                                     
4    Hi guys! I'm thinking about making a series of...
Name: post_text, dtype: object

In [20]:
skeptics['post_text'].head(10)

0                                                     
1                                                     
2                                                     
3    i thought it was anti-nuclear propaganda, whic...
4                                                     
5                                                     
6    https://watchers.news/2019/07/04/record-low-te...
7                                                     
8    I am a person who believes in climate change, ...
9    I'm not particularly interested in short artic...
Name: post_text, dtype: object

In [21]:
#Saving the cleaned dataframes before I merge them into one DF. 
change.to_csv("./data/climatechange_cleaned.csv", index = False)
skeptics.to_csv("./data/climateskeptics_cleaned.csv", index = False)


## Feature Engineering and Data Extraction

In [22]:
#Creating a new datafram, 'df', that combines our Climate Change and Climate Skeptics subreddits
data = [change, skeptics]
df = pd.concat(data)

In [23]:
#Because there aren't all that many columns in our DF there really aren't many features to be engineered. I 
#DO need to engineer my 'y' variable though, which is subreddits. So let's binarize that column.
#I will be tokenizing my text and title columns which could technically count as feature engineering,
#But that work is shown in the preprocessing notebook.

df['subreddit'] = df['subreddit'].map({'climatechange': 1, 'climateskeptics': 0})

In [24]:
# I'm going to create a new column where I combine my post_title and post_text. This could be helpful
# For feature vectorization. 
df['combined_text'] = df['post_title'] + " " + df['post_text']

In [25]:
#Now let's save and extract our cleaned and preprocessing ready DF before we start working with it.
df.to_csv("./data/df.csv", index=False)

## Preprocessing and Model Creation 

### Baseline
- Before we do anything with modeling, let's first calculate our Baseline score, which is essentially the simplest prediction we can make. I want all of my model's to perform better than the baseline score.

In [26]:
baseline = df['subreddit'].value_counts(normalize = True).max()

print(f"Our Baseline score is {baseline}. What this is saying is that about {np.round(baseline * 100)}%"
      + " of our data entries \ncome from the 'climatechange' subreddit")

Our Baseline score is 0.5012165450121655. What this is saying is that about 50.0% of our data entries 
come from the 'climatechange' subreddit


### Preprocessing and Model Creation Part 1: The Hard Way

***It's time to start all of my preprocessing. I'm going to set up a function that will be able to take in a corpus, clean it, remove stop words, then lemmatize it.***

In [27]:
#Instantiating our lemmatizer
lemmatizer = WordNetLemmatizer()

#This function will manually clean any string that it receives as an input.
def clean_text(string):
        #This will remove most HTML artifacts in the text
        cleaned_json = BeautifulSoup(string).get_text()
        #This regular expression will remve all punctuation and numbers from our string
        letters_only = re.sub("[^a-zA-Z]", " ", cleaned_json)
        #This line will format the string into lower case, then split it along spaces
        words = letters_only.lower().split()
        # Creating a stops variable that includes all the common english stopwords along with some 
        # That are specifically for my subreddits (these words specifically occur extremely frequently in
        # Both subreddits so they likely have a negative impact on model score.
        stops = set(stopwords.words('english') + ['climate', 'change'])
        # Take our words and create a variable that ONLY includes words that weren't in our stop words
        meaningful_words = [w for w in words if not w in stops]
        # Lemmatize the our meaningul words, hopefully allowing for more meaningful and impactful 
        lemmatized_words = [lemmatizer.lemmatize(i) for i in meaningful_words]
        # Finally, join our lemmatized words, adding a space between each word, then return the result.
        return (" ".join(lemmatized_words))

In [28]:
#Train Test Split incoming! TAKE COVER
X_train, X_test, y_train, y_test = train_test_split(df[['combined_text']], 
                                                    df['subreddit'], 
                                                    random_state = 42)                                                        

In [29]:
#Time to apply our function and create the word matrixes for our model!

clean_train_text = []
clean_test_text = []

print("Cleaning and lemmatizing X_train, give me a second...")

for combined_text in X_train['combined_text']:
    clean_train_text.append(clean_text(combined_text))
print("Done!")

print("Cleaning and lemmatizing X_test, give me a second...")

for combined_text in X_test['combined_text']:
    clean_test_text.append(clean_text(combined_text))
print("Done!")

Cleaning and lemmatizing X_train, give me a second...
Done!
Cleaning and lemmatizing X_test, give me a second...
Done!


In [30]:
#Instantiating our "tokenizer". The parameters I'm setting will mostly allow it to pull a features list
#Out of our already created lemmatized word lists.

vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 5000) 
tfidf_vectorizer = TfidfVectorizer(analyzer = "word",
                                   tokenizer = None,
                                   preprocessor = None,
                                   max_features = 5000)

train_data_features = vectorizer.fit_transform(clean_train_text)
test_data_features = vectorizer.transform(clean_test_text)

In [31]:
train_data_features = train_data_features.toarray()

In [32]:
lr1 = LogisticRegression()
lr1.fit(train_data_features, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [33]:
print(f"Train Count Vec Logistic Regression score is {lr1.score(train_data_features, y_train)}")
print(f"Test Count Vec Logistic Regression score is {lr1.score(test_data_features, y_test)}")

Train Count Vec Logistic Regression score is 0.981611681990265
Test Count Vec Logistic Regression score is 0.8152350081037277


***Ok so our initial score is better then expected, but as can be observed, our model has high variance and is scoring nowhere near as well against the test data. It is probably overfit.***

In [34]:
#Let's create a confusion matrix for this model.
y_pred = lr1.predict(test_data_features)
cm = confusion_matrix(y_test, y_pred)

In [35]:
cm_df = pd.DataFrame(cm,
             columns=['pred /r/climatechange', 'pred /r/climateskeptics'],
             index=['actual /r/climatechange', 'actual /r/climateskeptics'])

In [36]:
cm_df

Unnamed: 0,pred /r/climatechange,pred /r/climateskeptics
actual /r/climatechange,258,51
actual /r/climateskeptics,63,245


In [37]:
tp, fn, fp, tn = cm.ravel()

In [38]:
#Now let's take a look at all of our differect classification metrics.
print("Our model's classification metric scores are as follows:")
print(f"Accuracy: {(tp+tn)/(tp+fn+fp+tn)}") 
print(f"Misclassification Rate: {(fp+fn)/(tp+fn+fp+tn)}")
print(f"Sensitivity: {(tp)/(tp+fn)}")
print(f"Specificity: {(tn)/(tn+fp)}")
print(f"Precision: {(tp)/(tp+fp)}")
    

Our model's classification metric scores are as follows:
Accuracy: 0.8152350081037277
Misclassification Rate: 0.1847649918962723
Sensitivity: 0.8349514563106796
Specificity: 0.7954545454545454
Precision: 0.8037383177570093


### Preprocessing and Model Creation Part 2: Gridsearch and Pipelines

#### Setting up our Pipeline and Gridsearch Variables
- I want to get the best model score possible and I have quite a few different modeling techniques that I could use. By creating data pipelines where I try different vectorizers and models I can use gridsearch to test out multiple hyperparameters within those models and vectorizers to test out what positively impacts score the most. Let's do that.

In [39]:
df['combined_text'] = df['combined_text'].str.lower()
df['post_title'] = df['post_title'].str.lower()

In [40]:
#Setting X and y variables for my models. I want to try using my 'combined_text' column for scoring
#but I'd also like to check how 'post_title' scores with models, so let's set two different X variables
#to test with.
X1 = df['combined_text']
X2 = df['post_title']
y = df['subreddit']

In [41]:
#Let's set up two train test splits here
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y, stratify=y, random_state=42)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, stratify=y, random_state=42)

In [42]:
#Setting up my pipelines.

#Pipe1 will be using a count vectorizer to tokenize and logistic regression as its classification model.
pipe1 = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

#Pipe2 will be using a Tf-Idf vectorizer to tokenize and logistic regression as its classification model
pipe2 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

#Pipe 3 will be using a count vectorizer to tokenize and Multinomial Naive Bayes as its classification model
pipe3 = Pipeline([
    ('cvec', CountVectorizer()),
    ('mn', MultinomialNB())
])

#Pipe 4 will be using a Tf-Idf vectorizer to tokenize and Multinomial Naive Bayes as its classification model
pipe4 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('mn', MultinomialNB())
])

In [43]:
# Setting up Vectorizer Hyperparameters. My pipeline will go through all of these and give me the 
# best score it can achieve.

cvec_pipe_params = {
    'cvec__stop_words': [None, 'english'],
    'cvec__max_features': [1000, 3000, 5000, 6000],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)],
    'cvec__token_pattern': ["[^a-z]"],
}

    
tvec_pipe_params = {
    'tvec__stop_words': [None, 'english'],
    'tvec__max_features': [1000, 3000, 5000, 6000],
    'tvec__ngram_range': [(1,1), (1,2), (1,3)],
    'tvec__token_pattern': ["[^a-z]"],
}


In [44]:
#Setting up our Gridsearch variables with each of our pipelines
gs1 = GridSearchCV(pipe1, param_grid=cvec_pipe_params, cv=3)
gs2 = GridSearchCV(pipe2, param_grid=tvec_pipe_params, cv=3)
gs3 = GridSearchCV(pipe3, param_grid=cvec_pipe_params, cv=3)
gs4 = GridSearchCV(pipe4, param_grid=tvec_pipe_params, cv=3)

## Fitting and Scoring with our Gridsearch Variables

In [45]:
#Alright time to kill my laptop. Let's fit and score these models. With all this we'll be able to see 
#What features worked best for scoring.
gs1.fit(X1_train, y1_train)
gs2.fit(X1_train, y1_train)
gs3.fit(X1_train, y1_train)
gs4.fit(X1_train, y1_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...True,
        vocabulary=None)), ('mn', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'tvec__stop_words': [None, 'english'], 'tvec__max_features': [1000, 3000, 5000, 6000], 'tvec__ngram_range': [(1, 1), (1, 2), (1, 3)], 'tvec__token_pattern': ['[^a-z]']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [46]:
#Let's look at our best scores:
print(f" Count Vectorizer LR train score is {(gs1.score(X1_train, y1_train))}.")
print(f" Count Vectorizer LR test score is {(gs1.score(X1_test, y1_test))}.")
print(f" Tf-Idf LR train score is {(gs2.score(X1_train, y1_train))}.")
print(f" Tf-Idf LR test score is {(gs2.score(X1_test, y1_test))}.")
print(f" Count Vectorizer Multinomial NB train score is {(gs3.score(X1_train, y1_train))}.")
print(f" Count Vectorizer Multinomial NB test score is {(gs3.score(X1_test, y1_test))}.")
print(f" Tf-Idf Multinomial NB train score is {(gs4.score(X1_train, y1_train))}.")
print(f" Tf-Idf Multinomial NB test score is {(gs4.score(X1_test, y1_test))}.")

 Count Vectorizer LR train score is 0.7393185505678745.
 Count Vectorizer LR test score is 0.7244732576985413.
 Tf-Idf LR train score is 0.6814494321254733.
 Tf-Idf LR test score is 0.640194489465154.
 Count Vectorizer Multinomial NB train score is 0.6560302866414278.
 Count Vectorizer Multinomial NB test score is 0.5980551053484603.
 Tf-Idf Multinomial NB train score is 0.6473769605191996.
 Tf-Idf Multinomial NB test score is 0.6045380875202593.


In [47]:
#And our best params...
print(f"\nCount Vectorizer LR model's best params are \n{gs1.best_params_}")
print(f"\n Tf-Idf LR model's best params are \n{gs2.best_params_}")
print(f"\nCount Vectorizer Multinomial NB model's best params are \n{gs3.best_params_}")
print(f"\n Tf-Idf Multinobial NB model's best params are \n{gs4.best_params_}")


Count Vectorizer LR model's best params are 
{'cvec__max_features': 3000, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': None, 'cvec__token_pattern': '[^a-z]'}

 Tf-Idf LR model's best params are 
{'tvec__max_features': 5000, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': None, 'tvec__token_pattern': '[^a-z]'}

Count Vectorizer Multinomial NB model's best params are 
{'cvec__max_features': 6000, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': None, 'cvec__token_pattern': '[^a-z]'}

 Tf-Idf Multinobial NB model's best params are 
{'tvec__max_features': 3000, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': None, 'tvec__token_pattern': '[^a-z]'}


| Model                  | Vectorizer   | Set   | Score |
|------------------------|--------------|-------|-------|
| Model 1 LR             | Custom Count | Train | .9816 |
| Model 1 LR             | Custom Count | Test  | .8103 |
| Model 2 LR             | Count        | Train | .7393 |
| Model 2 LR             | Count        | Test  | .7244 |
| Model 3 LR             | Tf-Idf       | Train | .6814 |
| Model 3 LR             | Tf-Idf       | Test  | .6401 |
| Model 4 Multinomial NB | Count        | Train | .6560 |
| Model 4 Multinomial NB | Count        | Test  | .5980 |
| Model 5 Multinomial NB | Tf-Idf       | Train | .6473 |
| Model 5 Multinomial NB | Tf-Idf       | Test  | .6045 |

***Ok so these scores were actually worse then my first model. Huh. Some things to note though: the best N_gram range was always 1, 3 and the optimal max_features was different for each pipeline.***

### Model Attempt 3: Using Just the 'post_title' Column

In [48]:
gs1_post_title_only = gs1.fit(X2_train, y2_train)
gs2_post_title_only = gs2.fit(X2_train, y2_train)
gs3_post_title_only = gs3.fit(X2_train, y2_train)
gs4_post_title_only = gs4.fit(X2_train, y2_train)

In [49]:
print(f" Count Vectorizer LR train score is {(gs1_post_title_only.score(X2_train, y2_train))}.")
print(f" Count Vectorizer LR test score is {(gs1_post_title_only.score(X2_test, y2_test))}.")
print(f" Tf-Idf LR train score is {(gs2_post_title_only.score(X2_train, y2_train))}.")
print(f" Tf-Idf LR test score is {(gs2_post_title_only.score(X2_test, y2_test))}.")
print(f" Count Vectorizer Multinomial NB train score is {(gs3_post_title_only.score(X2_train, y2_train))}.")
print(f" Count Vectorizer Multinomial NB test score is {(gs3_post_title_only.score(X2_test, y2_test))}.")
print(f" Tf-Idf Multinomial NB train score is {(gs4_post_title_only.score(X2_train, y2_train))}.")
print(f" Tf-Idf Multinomial NB test score is {(gs4_post_title_only.score(X2_test, y2_test))}.")

 Count Vectorizer LR train score is 0.7306652244456463.
 Count Vectorizer LR test score is 0.6304700162074555.
 Tf-Idf LR train score is 0.6533261222282315.
 Tf-Idf LR test score is 0.5721231766612642.
 Count Vectorizer Multinomial NB train score is 0.6619794483504597.
 Count Vectorizer Multinomial NB test score is 0.5834683954619124.
 Tf-Idf Multinomial NB train score is 0.6630611141157382.
 Tf-Idf Multinomial NB test score is 0.5883306320907618.


In [50]:
print(f"\nCount Vectorizer LR model's best params are \n{gs1_post_title_only.best_params_}")
print(f"\n Tf-Idf LR model's best params are \n{gs2_post_title_only.best_params_}")
print(f"\nCount Vectorizer Multinomial NB model's best params are \n{gs3_post_title_only.best_params_}")
print(f"\n Tf-Idf Multinobial NB model's best params are \n{gs4_post_title_only.best_params_}")


Count Vectorizer LR model's best params are 
{'cvec__max_features': 3000, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': None, 'cvec__token_pattern': '[^a-z]'}

 Tf-Idf LR model's best params are 
{'tvec__max_features': 1000, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': None, 'tvec__token_pattern': '[^a-z]'}

Count Vectorizer Multinomial NB model's best params are 
{'cvec__max_features': 3000, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': None, 'cvec__token_pattern': '[^a-z]'}

 Tf-Idf Multinobial NB model's best params are 
{'tvec__max_features': 3000, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': None, 'tvec__token_pattern': '[^a-z]'}


***Looks like these are my worst scores yet. I figured they would be worse but I thought it would probably be best practice to check.***

## Analysis
- My best model was actually one that I didn't use Hyper-Parameters on. I think I know why this happened though: in that model I used an edited stop list where I added in some key words that were widely used in both subreddits like 'climate and 'change'. Because those words were used so much, removing them likely allowed for the model to choose more impactful features. I also used a lemmatizer for my strings in model one, which also seems to have helped the model detect scores.
- A few more key takeaways: Hyperparameters matter. Despite my pipeline-gridsearch models being outperformed by my first model, they did have one advantage over it: They suffered from far less variance. In the future I'd like to figure out some ways to potentially add lemmatization and more customized stopwords sets as gridsearch hyperparameters. That would potentially allow for higher scoring models with low variance.
- If I could keep working on this in the future, I'd want to collect more data. With more text I believe I could further increase the model's accuracy. It would also be immensely helpful if I had more post text to work with rather then just titles. Alternatively, if I could set up an image recognition neural network, I think that could be another interesting way to use this data in the future (a lot of posts gathered had images instead of text).
