## Pre-processing and Training Data Development

The goal of the preprocessing work is to prepare your data for fitting models. If you
identified any categorical features in your dataset in the EDA step, now is the time to
create dummy features to allow for the inclusion of those features in your model
development. Additionally, standardizing the features numeric magnitude and creating
train and test data subsets happen in this step. You may want to save a version of your
clean, preprocessed data frame as a CSV to access later.
If you need a refresher about how to complete this work, review the work you did during
the guided capstone and revisit the DSM Medium article .
The following steps should be completed in a Jupyter Notebook.
Goal: Create a cleaned development dataset you can use to complete the
modeling step of your project.
Steps:
● Create dummy or indicator features for categorical variables
● Standardize the magnitude of numeric features using a scaler
● Split into testing and training datasets
Review the following questions and apply them to your dataset:
● Does my data set have any categorical data, such as Gender or day of the week?
● Do my features have data values that range from 0 - 100 or 0-1 or both and more? 


In [2]:
# import necesary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
# load cleaned dataset
df = pd.read_csv('sentimentdataset_cleaned.csv', index_col = False)

Considerations for the 

Feature selection: Since we are simply attempting to evaluate the sentiment of a social media post, we only need to consider our text field which includes the combined raw text and hashtags, tokenized, and filtered for punctuation, emojis, and stand-alone digits. Other fields such as likes, date, and country can be ignored for the purposes of evaluating a predictive model for just the processed text.

Scaling: Since there are no numeric features here and we only have one input parameter, feature scaling may not be necesary.

Outliers: Since the text is an aggregation of social media posts, outliers shouldn't have a significant affect on the models as it might if the domain of focus was more narrow and or technical such as a physics research paper in an analysis of medical papers.

Imbalanced datasets: In our case, we know that there is an imbalance of the three categories. 


In [4]:
# save our text as features, X and sentiment data as a target, y
X = df['Text_Combined']
y_v = df['Sentiment_VADER']
y_tb = df['Sentiment_TextBlob']

In [5]:
# split features and target into training and test sets
X_vtrain, X_vtest, y_vtrain, y_vtest = train_test_split(X, y_v, test_size=0.2, random_state=123)
X_tbtrain, X_tbtest, y_tbtrain, y_tbtest = train_test_split(X, y_tb, test_size=0.33, random_state=123)

In [6]:
# export training and test data
X_vtrain.to_csv('X_vtrain.csv', index = False)
X_tbtrain.to_csv('X_tbtrain.csv', index = False)
y_vtrain.to_csv('y_vtrain.csv', index = False)
y_tbtrain.to_csv('y_tbtrain.csv', index = False)
X_vtest.to_csv('X_vtest.csv', index = False)
X_tbtest.to_csv('X_tbtest.csv', index = False)
y_vtest.to_csv('y_vtest.csv',index = False)
y_tbtest.to_csv('y_tbtest.csv',index = False)