<a href="https://colab.research.google.com/github/kmsekgothe/load-shortfall-regression-predict-api/blob/master/KaggleNotebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Team 12 - Advanced Classification Predict

© Explore Data Science Academy

---

### Introduction: 
---

### Predict Overview




<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages

In this section we import the necessary libraries needed for Data Analysis, Data Manipulation, Data Visualization and Model Building.

In [1]:
# libraries needed for Data Analysis and  Data Manipulation
import numpy as np # used to evaluate arrays
import pandas as pd # used to create and utilise tabular data ie Pandas DataFrame

# libraries to be used for Data Visualization
import matplotlib.pyplot as plt # used to visualize data
import seaborn as sns # used to visualize data
from matplotlib import rc
%matplotlib inline

# Libraries for data preparation and model building
import sklearn
import re
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
import string
import requests
from time import sleep
from nltk.corpus import stopwords
from wordcloud import WordCloud

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# Suppress cell warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# !pip install wordcloud

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

- Loading of Test and Train datasets. 
- Concatenate the datasets to ensure Data Engineering is done only once (for convenience). 
- Dataframes will then be split later on when needed.

In [2]:
df = pd.read_csv('train.csv') # load the data
df_test = pd.read_csv('test_with_no_labels.csv')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'

Perform basic analysis on the dataframe.

In [None]:
# Basic Train Analysis

df.shape # train DataFrame has 15 819 rows and 3 columns

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df['sentiment'].unique()

In [None]:
df['sentiment'].value_counts().plot(kind = 'bar')
plt.title('Class Frequency')
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()

In [None]:
unique, counts = np.unique(df['sentiment'], return_counts=True)
unique_counts_dict = {'Unique Count':
             {
                 "Class -1": counts[0],
                "Class 0": counts[1],
              "Class 1": counts[2],
              "Class 2": counts[3]
              }
             }
unique_count = pd.DataFrame(data=unique_counts_dict)
unique_count.sort_values(by='Unique Count', ascending=False)

Class Description:

- 2 News: the tweet links to factual news about climate change
- 1 Pro: the tweet supports the belief of man-made climate change
- 0 Neutral: the tweet neither supports nor refutes the belief of man-made climate change
- -1 Anti: the tweet does not believe in man-made climate change

Note the imbalance here: there are over 8000 observations in class 1 and only 1296 observations in class -1.

In [None]:
# Basic Test Analysis

df_test.shape # test DataFrame has 10 546 rows and 2 columns

In [None]:
df_test.head()

In [None]:
df_test.info()

<a id="three"></a>
## 3. Data Preprocessing
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

#### Data Cleaning and Formatting
 

Before we can do **Exploratory Data Analysis** (EDA) in section 4, we need to ensure that our data is in the correct format that can actually be used.

In [None]:
df.head()

In [None]:
# Better view of what's in the dataset
for i, row in df.iterrows():
    print(i)
    print(row)
    print("\n")

In [None]:
df.head()

In [None]:
# Create new column, data = linkedembedded urls from message column

def extract_urls(string):
    url_pattern = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    url = re.findall(url_pattern, string)      
    return str([x[0] for x in url])

df['url']  = df['message'].apply(extract_urls)

In [None]:
url_pattern = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
df['message'] = df['message'].replace(to_replace = url_pattern, value = r' ', regex = True)
df

In [None]:
#why is there a url in message column?!
df['message'][15814]

In [None]:
df['url'] = df['url'].astype(str).str[1:-1]

In [None]:
df['url'] = df['url'].str.replace("'", "")

In [None]:
df

In [None]:
# Extract sentiment information from urls. i.e. web page titles.

def extract_web_title(url):
    if len(url) > 0:
        params = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
        get_url = requests.get(url, headers=params) # Sends a GET request
        url_text = get_url.text
        return url_text[url_text.find('<title>') + 7 : url_text.find('</title>')]

In [None]:
extract_web_title('https://t.co/yeLvcEFXkC')

In [None]:
df['url'] = df['url'].apply([extract_web_title(x) for x in df['url']])

In [None]:
extract_web_title("https://t.co/yeLvcEFXkC")

#### Clean data

In [None]:
# Remove special characters

def clean_data(tweet):
    pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
    tweet = re.sub(pattern_url, '', tweet) 
    tweet = re.sub(r'@[A-Za-z0-9]+', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    tweet = re.sub(r'RT[\s]+', '', tweet)
    return tweet

In [None]:
df['clean_tweet'] = df['message'].apply(clean_data)

In [None]:
# Remove punctuation

def remove_punctuation(tweet):
    return ''.join([l for l in tweet if l not in string.punctuation])

In [None]:
df['clean_tweet'] = df['clean_tweet'].apply(remove_punctuation)

In [None]:
# Make all the text lower case to remove some noise from capitalisation

def remove_cap(tweet):
    return tweet.lower()

df['clean_tweet'] = df['clean_tweet'].apply(remove_cap)

In [None]:
tokeniser = TreebankWordTokenizer()
df['clean_tweet'] = df['clean_tweet'].apply(tokeniser.tokenize)

In [None]:
df.head()

In [None]:
# Lemmetize the words in the dataframe

def lemma(words, lemmatizer):
    return [lemmatizer.lemmatize(word) for word in words]   

In [None]:
lemmatizer = WordNetLemmatizer()
df['clean_tweet'] = df['clean_tweet'].apply(lemma, args=(lemmatizer, ))

In [None]:
df.head()

In [None]:
df['clean_tweet'] = df['clean_tweet'].map

In [None]:
# Remove stopwords

#def remove_stop_words(tokens):    
#    return [t for t in tokens if t not in stopwords.words('english')]

#df['clean_tweet'] = df['clean_tweet'].apply(remove_stop_words)

#### Data Cleaning and Formatting Summary

- 

<a id="four"></a>
## 4. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to check assumptions with the help of summary statistics and graphical representations.
The following section analyses and provides an overview of the given data

In [None]:
# Visualize the frequent words
#all_words = " ".join([sentence for sentence in df['clean_tweet']])
#df['clean_tweet'] = ''.join(map(str, df['clean_tweet']))
#wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(df['clean_tweet'])
df
# plot the graph
#plt.figure(figsize=(15,8))
#plt.imshow(wordcloud, interpolation='bilinear')
#plt.axis('off')
#plt.show()

#### Key Insights

- 


<a id="five"></a>
## 5. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Create targets and features dataframes then seperate the test from the train data set.

In [None]:
# feature extraction

vector = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words=None, ngram_range=(1, 3)) #max_df=0?
X = vector.fit_transform(df['message'])
y = df['sentiment']

In [None]:
# create targets and features dataset
#y =  df['sentiment']
#X = df.drop('sentiment', axis=1)

X = our features or independant variables (IVs). These will be used to predict our depedant variable. 

Y = dependant/target variable is also known as the dependent variable (DV) and is the target variable we want to predict.

In [None]:
# split the train data further into train/test data (to perform validation before bringing in the true unseen test data)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50, shuffle=False)

### Feature Scaling

In [None]:
#scaler = preprocessing.MinMaxScaler()
#X_scaled = scaler.fit_transform(X)

<a id="six"></a>
## 6. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

This section takes us through the machine learning process. We train and test a number of regression model algorithms and later select the model with the best performance to be used in this project. From the five modeling techniques, we compare the RMSE values of each model as well as the time taken to train and test each model. This will inform our model selection decision.

#### Model 1 - Logistic Regression

In [None]:
# Training
model = LogisticRegression()
model.fit(X_train, y_train)

#### Model 2

#### Model 3

#### Model 4 

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [None]:
pred = model.predict(X_test)

In [None]:
print('Classification Report')
print(classification_report(y_test, pred))

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>



##### XGBoost Regression

This section discusses the inner workings of the best performing model in a simple way.

## Conclusion

Given two datasets, train and test data, we were tasked with following the data science process. 

We first set out to understand the data and it's space, in other words, we set out to understand the electricity shortfall in Spain, the various variables that may or may not be correlated to the load shortfall in Spain, understanding the reasons behind the correlations or a lack thereof etc. We were presented with a lot of information and our first task was making sure we had a good understanding of the relevant things coming from the data. 

In our data, we saw null values we had to impute, we encountered unusable data types we needed to transfrom, and data falling in large ranges which we had to scale. All of this formed part of the iterative process of Data Preprocessing, Exploratorty Data Analysis and Data Engineering.

Essentially, we set out to understand and transform the data so that we may build an appropriate model that would best predict the load shortfall. 

We then built a few models then selected the right model out of these models, following an iterative train-validation process. The main model performance metric was the Root Mean Squared Error (RMSE), looking for the model that produced the lowerst RMSE value.

At this point, we have addressed the problem statement. In future, when given various cirsumstances (predictor observations) we are now able to predict the corresponding load shortfall with an average error (RMSE) of approximately 4300 (same units as the predicted variable, i.e. load shortfall).

#### Kaggle submission file
This section creates the Kaggle submission file in csv format.

In [None]:
df.head()

In [None]:
df_test.head()

In [None]:
x_train_final = df['message']
x_test_final = df_test['message']

In [None]:
vector = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words=None, ngram_range=(1, 3)) #max_df=0?
x_train_final = vector.fit_transform(df['message'])
x_test_final = vector.fit_transform(df_test['message'])
y = df['sentiment']

In [None]:
model.fit(x_train_final, y)

In [None]:
predict_final = model.predict(x_test_final)

In [None]:
daf = pd.DataFrame(predict_final, columns=['sentiment'])
daf.head()

In [None]:
df_test_final = pd.read_csv('test_with_no_labels.csv')

In [None]:
output = pd.DataFrame({"TweetID":df_test_final['tweetid']})
submission = output.join(daf)        
submission.to_csv("submission.csv", index=False)

In [None]:
submission