<a href="https://colab.research.google.com/github/DavidMichaelH/machine-learning/blob/main/Kaggle/DisasterTweets/notebooks/NOTEBOOK_KAGGLE_DISASTER_TWEET.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Kaggle Disaster Tweets**

In this notebook, we aim to achieve a high accuracy in the Kaggle Disaster Tweets competition by utilizing feature engineering techniques to improve the performance of a **logistic regression mode**l. We will then compare the results with a more sophisticated model like **Long-Short Term Memory (LSTM)** to see if a higher score can be achieved.


You should expect to see at least $\textbf{79%}$ validation accuracy for the **logistic regression model** and $\textbf{>80%}$ for the **LSTM model** by running the notebook as is.  

>[Kaggle Disaster Tweets](#scrollTo=fIstEV7Xwl5z)

>>[Problem statement and objectives](#scrollTo=ngF6RA5sNZK1)

>[Setting up Kaggle Environment and Downloading the Competition Data](#scrollTo=dwrUhBv-FV0H)

>[Exploratory Data Analysis](#scrollTo=MQ_aak1pFay2)

>[Text Feature Extraction](#scrollTo=pjl9cUy5H1F4)

>[Model Selection and Evaluation](#scrollTo=zElcrAG-Fl2o)

>[Evaluate the Performance of the Model](#scrollTo=tFv2TleuFowM)

>[Deep Learning Approach](#scrollTo=-q2aX63zjclO)

>[Word Embeddings](#scrollTo=aWtfYdkDS9KB)

>[Deep Learning Models](#scrollTo=7DC6KNFgoXl6)



## Problem statement and objectives

Problem statement:
In the event of a disaster, people often turn to social media to share information and updates. However, sorting through these tweets and identifying the relevant information can be a time-consuming and challenging task. The goal of this project is to develop a model that can accurately classify tweets as disaster-related or not, to help emergency responders quickly identify and respond to crisis situations.

Objectives:

*  Develop a model that can accurately classify tweets as disaster-related or not
*  Preprocess and clean the tweets to remove irrelevant information
*  Extract features from the tweets to improve model performance
*  Train and evaluate the model using a combination of traditional machine learning and deep learning techniques
*  Improve the model's performance by optimizing the parameters
*  Evaluate the model's performance using metrics such as accuracy, precision, recall, and F1 score.
 


In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from tqdm import tqdm

# Sklearn imports
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import mutual_info_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold,KFold

# NLTK imports
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk import FreqDist, word_tokenize

# Wordcloud import
from wordcloud import WordCloud, STOPWORDS

# Keras imports
import tensorflow as tf
from keras.utils import np_utils, pad_sequences

# Plotting library
import matplotlib.pyplot as plt
import seaborn as sns 


# String manipulation 
import string
import re


! pip install category_encoders
import category_encoders as ce

In [None]:
#defining a list of english stopwords from the nltk corpus 
stop_words = stopwords.words('english') 

# Setting up Kaggle Environment and Downloading the Competition Data

Install and upgrade kaggle library

In [None]:
!pip install -q kaggle
!pip install kaggle --upgrade

Upload the json file used to autheticate with kaggle.

In [None]:
from google.colab import files
files.upload()

Create the directory where you will store the json file.

In [None]:
! mkdir ~/.kaggle

Copy the kaggle file into the new directory. This is where kaggle expects to find the file.

In [None]:
! cp kaggle.json ~/.kaggle/

Elevate the file permissions so it can be modified.

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

download the kaggle file

In [None]:
! kaggle competitions download -c nlp-getting-started

unzip the files

In [None]:
!unzip nlp-getting-started.zip

In [None]:
! mv train.csv labled_data.csv

In [None]:
! mv test.csv unlabled_data.csv

 Use pandas to load the data into a dataframe. Perform some initial cleaning and preprocessing of the data, such as removing missing values and duplicates.

In [None]:
labled = pd.pandas.read_csv("labled_data.csv")
unlabled = pd.pandas.read_csv("unlabled_data.csv")
sample = pd.pandas.read_csv("sample_submission.csv")

# Exploratory Data Analysis

Conduct an exploratory data analysis (EDA) to understand the distribution of the target variable, identify any missing values, and create visualizations for insight.

The first step is to view some of the labeled data examples.

In [None]:
labled.head(20)

This code cell randomly selects and prints a row from the labeled data set along with its corresponding text.

In [None]:
index = np.random.randint(0,len(labled))
print( labled.iloc[index])
print(labled.text[index])

We look at the distribution of the target variable to understand how many of each type we have. This will be useful to gauge whether we have a balanced set of example, each category should have a sufficiently representative sample.

In [None]:
sns.countplot(y=labled.target);

These code cells calculates and prints the percentage of missing values in each column of both labeled and unlabeled data sets.

In [None]:
print("Percetange of null entries in training set")
print("-"*30)
labled.isnull().sum()/len(labled)

In [None]:
print("Percetange of null entries in test set")
print("-"*30)
unlabled.isnull().sum()/len(unlabled)

**Exploring the Keyword Feature**

Compare the number of unique keywords in the training and test sets.

In [None]:
print(labled.keyword.nunique(), unlabled.keyword.nunique())

The keywords are an interesting feature in this dataset. So we begin by examining them closely.

We can obtain a sorted dataframe consiting of keywords and their number of occurances using the follow panda operations

In [None]:
labled.keyword.value_counts().head(10)

This code cell displays the top 15 most frequently occurring keywords in the training set.

In [None]:
labled.keyword.value_counts().iloc[:15]

We can perhaps better illustrate their frequency by ploting the first 15. 

This is the top 15 keywords including both disaster and non-disaster related tweets. 

In [None]:
def plot_top_values(series,plot_name,top_num=15):
  top_num = min(top_num,len(series))
  plt.figure(figsize=(9,6))
  sns.countplot(y=series, order = series.value_counts().iloc[:top_num].index)
  plt.title(plot_name)
  plt.show()

Display the top 15 keywords in the training set. 

In [None]:
kw = labled.keyword
plot_top_values(kw,'Top 15 keywords')

Separate disaster and non-disaster related tweets and display the top 15 keywords for each.

In [None]:
kw = labled[labled.target == 1].keyword;
plot_top_values(kw,'Top 15 Disaster keywords')

In [None]:
kw = labled[labled.target == 0].keyword;
plot_top_values(kw,'Top 15 Non-Disaster keywords')

We would like to know how the keyword might help us to decide the if the tweet corresponds to a disaster or not. They way might try to measure that is by taking averages over the target values for each particular keyword. 

In [None]:
def display_features(df,top_num_to_disp,feature_name,min_num_examples = 1,target_name = 'target',ascending=False,plot_title = ''):
    """
    This function is used to display the mean target value of the top N features, and plots them in a bar chart.
    The function takes in the following parameters:
    df : DataFrame - The dataframe containing the data to be plotted
    top_num_to_disp : int - The number of top features to display
    feature_name : str - The name of the feature column to groupby and display
    min_num_examples : int - The minimum number of examples for a feature to be included in the display
    target_name : str - The name of the target column to display the mean of
    ascending : bool - Order to sort the feature values by (default is False, descending)
    plot_title : str - Title for the plot (default is empty)
    """
    raw_loc = labled[feature_name].value_counts() # Get the raw counts of the feature values
    top_loc = list(raw_loc[raw_loc>=min_num_examples].index) # Get the top features that have at least min_num_examples
    top_only = labled[labled[feature_name].isin(top_loc)] # Filter the dataframe to only include the top features

    df = top_only.groupby(feature_name).mean()[target_name].sort_values(ascending=ascending) # groupby feature and calculate mean of target

    plt.figure(figsize=(20,14)) # set the size of the plot
    # Create a bar plot of the top number of features to display
    sns.barplot(x=df.iloc[:top_num_to_disp].index, y=df.iloc[:top_num_to_disp])
    # Draw a line at the mean target value
    plt.axhline(np.mean(labled[target_name]))
    plt.xticks(rotation=80)
    if len(plot_title) > 0:
        plt.title(plot_title)
    plt.show()

Display the top 30 keywords that are most likely to be disaster tweets

In [None]:
display_features(labled,30,'keyword',min_num_examples= 10,ascending=False,plot_title='Keywords with highest % of disaster tweets')

Display the top 30 keywords that are least likely to be disaster tweets

In [None]:
display_features(labled,30,'keyword',min_num_examples= 10,ascending=True,plot_title='Keywords with highest % of non-disaster tweets')

**Exploring the Location Feature in Disaster Tweets**

The goal of this analysis is to understand the distribution of unique locations in the labeled and unlabeled datasets, as well as the relationship between location and disaster tweets.

In [None]:
# Check the number of unique locations
print("Number of unique locations in labeled dataset:", labled.location.nunique())
print("Number of unique locations in unlabeled dataset:", unlabled.location.nunique())

To get a better understanding of the distribution of locations in the labeled dataset, let's examine the top 15 locations.

In [None]:
# Examine the most common locations among the labeled dataset
kw = labled.location
plot_top_values(kw, 'Top 15 locations')


Next, let's see which of the top locations are most common among disaster tweets.

In [None]:
# Examine the top locations among disaster tweets
display_features(labled,10,'location',min_num_examples= 10,ascending=False,plot_title='Locations with highest % of disaster tweets')

And conversely, let's examine which of the top locations are most common among non-disaster tweets.

In [None]:
# Examine the top locations among non-disaster tweets
display_features(labled,10,'location',min_num_examples= 10, ascending=True,plot_title='Locations with highest % of non-disaster tweets')

Finally, let's see which locations are most commonly used in disaster and non-disaster tweets, respectively.

In [None]:
# Examine the top disaster locations
kw = labled[labled.target == 1].location
plot_top_values(kw, 'Top 15 Disaster locations')

# Examine the top non-disaster locations
kw = labled[labled.target == 0].location
plot_top_values(kw, 'Top 15 Non-Disaster locations')


# Text Feature Extraction

 Perform cleaning and preprocessing of the data, such as removing missing values and duplicates.

Find the number of duplicated items in the training set

In [None]:
# Check for duplicates in the training set
print("Number of duplicates:", labled.duplicated().sum())

# Remove duplicates, if any
labled = labled.drop_duplicates().reset_index(drop=True)


We create some functions to help us clean up the text while recording any potentially useful information in a new column to make sure we don't throw any valuable information away.

In [None]:
# Define stop words
STOP_WORDS = set(stop_words)

# Remove newlines, extra white space, and hyperlinks from text
def clean_text(text):
    text = re.sub(r'\n', ' ', text)
    text = re.sub('\s+', ' ', text).strip()
    text = re.sub(r'https?://\S+', '', text)
    return text

# Test the function
test_str = "A man went to https://www.google.com/ and \n looked up a dog wearing a hat!"
print("Original text:", test_str)
print("Cleaned text:", clean_text(test_str))

In [None]:
# Remove stop words, lowercase the text, remove hashtags and mentions from text
def process_text(text):
    text = re.sub(r'#','', text)
    text = re.sub(r'@','', text)
    words = word_tokenize(text)
    text = " ".join([word for word in words if word.lower() not in STOP_WORDS])
    text = text.lower()

    return text

# Test the function
test_str = "I love #dogs and @cats!"
print("Original text: " + test_str)
print("Cleaned text: " + process_text(test_str))

In [None]:
# Extract tokens matching a regular expression from text
def find_reg_ex(tweet, reg_ex):
    tokens = re.finditer(reg_ex, tweet)
    token_list = []
    for match in tokens:
        token = match.group(0)
        token_list.append(token)
    if len(token_list) == 0:
        return 'no'
    else:
        return " ".join(token_list)

# Find hashtags in text
def find_hashtags(tweet):
    hashtags = re.finditer(r"#\w+", tweet)
    hashtags = [match.group(0)[1:] for match in hashtags]
    return " ".join(hashtags) or 'no'

# Extract various features from text
def extract_text_features(df):
    df['text_clean'] = df['text'].apply(lambda x: clean_text(x))
    df['hashtags'] = df['text'].apply(lambda x: find_hashtags(x)) # hashtags
    df['mentions'] = df['text'].apply(lambda x: find_reg_ex(x,r"@\w+")) # mentions
    df['links'] = df['text'].apply(lambda x: find_reg_ex(x,r"https?://\S+")) # links    
    df['text_len'] = df['text_clean'].apply(len) # Tweet length    
    df['word_count'] = df["text_clean"].apply(lambda x: len(str(x).split())) # Word count   
    df['stop_word_count'] = df['text_clean'].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS])) # Stopword count    
    df['punctuation_count'] = df['text_clean'].apply(lambda x: len([c for c in str(x) if c in string.punctuation])) # Punctuation count    
    df['hashtag_count'] = df['hashtags'].apply(lambda x: len(str(x).split())) # Count of hashtags (#)    
    df['mention_count'] = df['mentions'].apply(lambda x: len(str(x).split())) # Count of mentions (@)    
    df['link_count'] = df['links'].apply(lambda x: len(str(x).split())) # Count of links  
    df['caps_count'] = df['text_clean'].apply(lambda x: sum(1 for c in str(x) if c.isupper())) # Count of uppercase letters
    df['caps_ratio'] = df['caps_count'] / df['text_len'] # Ratio of uppercase letters
    df['text_clean'] = df['text_clean'].apply(lambda x: process_text(x))
    return df
 
    



After some processing, we will extract features from the text data and add them as new columns to the dataframe. We'll use the function `extract_text_features` to process the `labled` and `unlabled` dataframes:



In [None]:
labled = extract_text_features(labled) 
unlabled = extract_text_features(unlabled) 

print("The shapes of the labled and unlabled dataframes are:", labled.shape, unlabled.shape)

Now we measure the linear correlation between the columns and the target Next, we measure the linear correlation between the columns and the target value. However, the results don't show a strong positive correlation. The column `id` ranks second, which is expected to be meaningless. The `stop_word_count` might have some significance, but it's unclear. It's possible that a tweet with many filler words may not be particularly urgent, but this is just speculation.



In [None]:
corr_df = labled.corr()['target'].drop('target').sort_values()
print("Linear correlation between columns and target: \n", corr_df)

Of interest, only the `text_len` and `stop_word_count` appear to have a significant correlation. These columns might be included in the final set of features and the others ignored.


In [None]:
def top_words_frequency_bar_plot(text,ax = None,plot_title='',num_top_words = 20):
  tokens = word_tokenize(text)

  # Remove stopwords and non-alphabetic characters from the tokenized text
  filtered_tokens = [w for w in tokens if (w not in STOP_WORDS) & (w.isalpha())]

  # Count the frequency of each word
  word_freq = FreqDist(filtered_tokens)

  # Convert the word frequency to a dataframe
  df_word_freq = pd.DataFrame.from_dict(word_freq, orient='index', columns=['count'])

  # Select the top 20 words based on frequency count
  top20w = df_word_freq.sort_values('count',ascending=False).head(num_top_words)

  # Create a bar plot of the top 20 words
 
  if ax is None:
    # Create a figure and axes object
    fig, ax = plt.subplots(figsize=(8,6))

  sns.barplot(top20w['count'], top20w.index,ax = ax)
  ax.set_title(plot_title)

In [None]:
text = ' '.join(labled['text_clean']).lower()
top_words_frequency_bar_plot(text)

Plot the Most Frequent Words in Disaster and Non-Disaster Tweets

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15,10))

text = ' '.join(labled.loc[labled.target==1, 'text_clean']).lower()
top_words_frequency_bar_plot(text,ax = ax1,plot_title='disaster')
 

text = ' '.join(labled.loc[labled.target==0, 'text_clean']).lower()
top_words_frequency_bar_plot(text,ax = ax2,plot_title='non-disaster')
 

In [None]:
# Bigrams

from nltk import bigrams


def top_bigrams_frequency_bar_plot(text,ax = None,plot_title='',num_top_bigrams = 20):

  tokens = word_tokenize(text)
  
  filtered_tokens = [w for w in tokens if (w not in STOP_WORDS) & (w.isalpha())]

  # Find bigrams from filtered tokens
  bigrams_d = list(bigrams(filtered_tokens))

  # Count the frequency of each bigram
  bigram_freq_d = FreqDist(bigrams_d)

  # Convert the bigram frequency to a dataframe
  df_bigram_freq_d = pd.DataFrame.from_dict(bigram_freq_d, orient='index', columns=['count'])

  # convert the index to a string
  df_bigram_freq_d.index = df_bigram_freq_d.index.map(lambda x: ' '.join(x))

  # Sort the bigrams by count
  df_bigram_freq_d = df_bigram_freq_d.sort_values('count',ascending=False)

  # Create a bar plot of the top 20 bigrams
  

  if ax is None:
    # Create a figure and axes object
    fig, ax = plt.subplots(figsize=(8,6))

  sns.barplot(df_bigram_freq_d.head(num_top_bigrams)['count'], df_bigram_freq_d.index[:num_top_bigrams],ax = ax)
  ax.set_title(plot_title)

This code creates a 2-subplot figure with bar plots of the top 15 bigrams and their frequencies for the disaster and non-disaster text data.

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(22,10))

text = ' '.join(labled.loc[labled.target==1, 'text_clean']).lower()
top_bigrams_frequency_bar_plot(text,plot_title='disaster',ax = ax1, num_top_bigrams = 15)

text = ' '.join(labled.loc[labled.target==0, 'text_clean']).lower()
top_bigrams_frequency_bar_plot(text,plot_title='non-disaster',ax = ax2, num_top_bigrams = 15)

The purpose of the following code cell is to vectorize and convert the "links," "mentions," and "hashtags" columns of two dataframes, train and test sets into separate dataframes with one column per feature, counting the number of times each feature appears in the original data. The "CountVectorizer" class from the "sklearn.feature_extraction.text" library is used to vectorize the data.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


# Links
vec_links = CountVectorizer(min_df = 5, analyzer = 'word', token_pattern = r'https?://\S+') # This vectorizer is used to count the number of times each link appears in the labled data.
link_vec = vec_links.fit_transform(labled['links']) # This vectorizes the labled data.
link_vec_test = vec_links.transform(unlabled['links']) # This vectorizes the unlabled data.
X_train_link = pd.DataFrame(link_vec.toarray(), columns=vec_links.get_feature_names()) # This converts the vectorized labled data into a dataframe.
X_test_link = pd.DataFrame(link_vec_test.toarray(), columns=vec_links.get_feature_names())  # This converts the vectorized unlabled data into a dataframe.

# Mentions
vec_men = CountVectorizer(min_df = 5)
men_vec = vec_men.fit_transform(labled['mentions'])
men_vec_test = vec_men.transform(unlabled['mentions'])
X_train_men = pd.DataFrame(men_vec.toarray(), columns=vec_men.get_feature_names())
X_test_men = pd.DataFrame(men_vec_test.toarray(), columns=vec_men.get_feature_names())

# Hashtags
vec_hash = CountVectorizer(min_df = 5)
hash_vec = vec_hash.fit_transform(labled['hashtags'])
hash_vec_test = vec_hash.transform(unlabled['hashtags'])
X_train_hash = pd.DataFrame(hash_vec.toarray(), columns=vec_hash.get_feature_names())
X_test_hash = pd.DataFrame(hash_vec_test.toarray(), columns=vec_hash.get_feature_names())
print (X_train_link.shape, X_train_men.shape, X_train_hash.shape)

The following code creates a bar plot of the top 15 most frequent link features in the data

In [None]:
X = X_train_link.sum()
num_to_display = 15

ascending = False
X = X.sort_values( ascending=ascending).head(num_to_display)

# plot the top column_name using seaborn
fig, ax = plt.subplots(figsize=(15,10))

sns.barplot(x= X , y =X.index,ax = ax)
 
plt.xticks(rotation=90)

# display the plot
plt.show()


The following code creates a bar plot of the top 15 most frequent mention features in the data

In [None]:
X = X_train_men.sum()
X = X.drop('no') #no is overwhelmingly common so we drop it for sake of analysis
num_to_display = 15


X = X.sort_values( ascending=ascending).head(num_to_display)

# plot the top column_name using seaborn
fig, ax = plt.subplots(figsize=(15,10))

sns.barplot(x= X , y =X.index,ax = ax)
 
plt.xticks(rotation=90)

# display the plot
plt.show()


The following code creates a bar plot of the top 15 most frequent hashtag features in the data

In [None]:
X = X_train_hash.sum()
X = X.drop('no') #no is overwhelmingly common so we drop it for sake of analysis
num_to_display = 15


X = X.sort_values( ascending=ascending).head(num_to_display)

# plot the top column_name using seaborn
fig, ax = plt.subplots(figsize=(15,10))

sns.barplot(x= X , y =X.index,ax = ax)
 
plt.xticks(rotation=90)

# display the plot
plt.show()


Next we examine the fraction of disaster tweets for each feature. In particular, a bar plot is generated to visualize the ratio of disaster tweets to all tweets for each feature (such as links, mentions, and hashtags) based on the training data.

In [None]:
X = X_train_link
num_to_display = 50

# Compute the ratio of disaster tweets to all tweets for each link
disaster_ratio = (X.transpose().dot(labled['target']) / X.sum(axis=0)).sort_values(ascending=False)

# Create a bar plot to visualize the results
plt.figure(figsize=(20,15))
disaster_ratio = disaster_ratio[:num_to_display]

sns.barplot(x=disaster_ratio, y=disaster_ratio.index)

# Add a line to indicate the average ratio across all links
plt.axvline(np.mean(labled.target))

plt.title('% of disaster tweet')

# Show the plot
plt.show()


In [None]:
X = X_train_men
num_to_display = 50

# Compute the ratio of disaster tweets to all tweets for each link
disaster_ratio = (X.transpose().dot(labled['target']) / X.sum(axis=0)).sort_values(ascending=False)

# Create a bar plot to visualize the results
plt.figure(figsize=(20,15))
disaster_ratio = disaster_ratio[:num_to_display]

sns.barplot(x=disaster_ratio, y=disaster_ratio.index)

# Add a line to indicate the average ratio across all links
plt.axvline(np.mean(labled.target))

plt.title('% of disaster tweet')

# Show the plot
plt.show()


In [None]:
X = X_train_hash
num_to_display = 100

# Compute the ratio of disaster tweets to all tweets for each link
disaster_ratio = (X.transpose().dot(labled['target']) / X.sum(axis=0)).sort_values(ascending=False)

# Create a bar plot to visualize the results
plt.figure(figsize=(20,15))
disaster_ratio = disaster_ratio[:num_to_display]

sns.barplot(x=disaster_ratio, y=disaster_ratio.index)

# Add a line to indicate the average ratio across all links
plt.axvline(np.mean(labled.target))

plt.title('% of disaster tweet')

# Show the plot
plt.show()


The next code cell performs TF-IDF transformation on the training and test sets text data using a TfidfVectorizer with both single words and bi-grams.

In [None]:
# Tf-idf for text
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize TfidfVectorizer with minimum document frequency of 10, n-gram range of (1,2) and stop words in english
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 

# fit the vectorizer to the labeled text data and transform it
text_vec = vec_text.fit_transform(labled['text_clean'])
# transform the unlabeled text data using the fitted vectorizer
text_vec_test = vec_text.transform(unlabled['text_clean'])

# create a dataframe for the transformed labeled text data using the feature names from the vectorizer
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names())

# create a dataframe for the transformed unlabeled text data using the feature names from the vectorizer
X_test_text = pd.DataFrame(text_vec_test.toarray(), columns=vec_text.get_feature_names())

# print the shape of the training text dataframe
print (X_train_text.shape)


In [None]:
# Joining the dataframes together
labled_ext = labled.copy()
unlabled_ext = unlabled.copy()

labled_ext = labled_ext.join(X_train_link, rsuffix='_link')
labled_ext = labled_ext.join(X_train_men, rsuffix='_mention')
labled_ext = labled_ext.join(X_train_hash, rsuffix='_hashtag')
labled_ext = labled_ext.join(X_train_text, rsuffix='_text')
unlabled_ext = unlabled_ext.join(X_test_link, rsuffix='_link')
unlabled_ext = unlabled_ext.join(X_test_men, rsuffix='_mention')
unlabled_ext = unlabled_ext.join(X_test_hash, rsuffix='_hashtag')
unlabled_ext = unlabled_ext.join(X_test_text, rsuffix='_text')
print (labled_ext.shape, unlabled_ext.shape)

In [None]:


# We drop the columns that we don't need for our model.
features_to_drop = ['id', 'keyword','location','text','location','text_clean', 'hashtags', 'mentions','links']

# We drop the target column from the labled data set (since it is the column that we want to predict).
X_labled = labled_ext.drop(columns = features_to_drop + ['target'])
X_unlabled= unlabled_ext.drop(columns = features_to_drop)

Y_labled = labled_ext.target


In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(X_labled.values, Y_labled, 
                                                  stratify=labled.target.values, 
                                                  random_state=23, 
                                                  test_size=0.1, shuffle=True)

In [None]:
x_train.shape

In [None]:
print (x_train.shape)
print (x_valid.shape)

In [None]:


# F-1 score
from sklearn.metrics import f1_score, accuracy_score
# Confusion matrix
from sklearn.metrics import confusion_matrix


In [None]:
def  get_performance_metrics(pipeline,x_train, x_valid, y_train, y_valid):
  print('-'*20)
  print ('Training accuracy: %.4f' % pipeline.score(x_train, y_train))
  print ('Validation accuracy: %.4f' % pipeline.score(x_valid, y_valid))
  print ('Training f-1 score: %.4f' % f1_score(y_train, pipeline.predict(x_train)))
  print ('Validation f-1 score: %.4f' % f1_score(y_valid, pipeline.predict(x_valid)))
  print(pd.DataFrame(confusion_matrix(y_train, pipeline.predict(x_train))))

# Model Selection and Evaluation

Select and train several models using the prepared data, such as logistic regression, Naive Bayes, Random Forest, etc. Use cross-validation to evaluate the performance of the models and to select the best model.

Perform k-fold cross-validation: In the next section, use the cross_val_score function from scikit-learn's model_selection module to perform k-fold cross-validation. You can use the feature matrix and target variable obtained in step 3 as input, and specify the number of folds, the model, and the evaluation metric as parameters.

In [None]:


# Target encoding
"""
# Declare the columns to be encoded
features = ['keyword','location']

# Initialize the encoder
encoder = ce.TargetEncoder(cols=features)
"""

"""
# Fit the encoder to the labeled dataset
encoder.fit(labled[features],labled['target'])

# Apply the encoding to the labeled dataset
labled = labled.join(encoder.transform(labled[features]).add_suffix('_target'))

# Apply the encoding to the unlabeled dataset
unlabled = unlabled.join(encoder.transform(unlabled[features]).add_suffix('_target'))
"""

def run_kfold_training(pipeline,X_labled, Y_labled):
  #kf = KFold(n_splits=10, shuffle=True, random_state=149)
  skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=149)
  scores = []


  for train_index, val_index in skf.split(X_labled.values, Y_labled): #kf.split(X_labled.values, Y_labled):
      """
      df = X_labled.copy()
       
      # Fit the encoder to the labeled dataset
      encoder.fit(labled[features].iloc[train_index],labled['target'].iloc[train_index])
      
      df = df.join(encoder.transform(labled[features]).add_suffix('_target'))
    """
      #encoder.transform(labled[features].iloc[train_index])
      #encoder.transform(labled[features].iloc[val_index])




      #x_train = X_labled.values[train_index]
      #x_valid = X_labled.values[val_index]
      #y_train = Y_labled[train_index]
      #y_valid = Y_labled[val_index]
      x_train, x_valid = X_labled.values[train_index], X_labled.values[val_index]
      y_train, y_valid = Y_labled[train_index], Y_labled[val_index]



      

      pipeline.fit(x_train, y_train)
      y_pred = pipeline.predict(x_valid)
      score = accuracy_score(y_valid, y_pred)
      get_performance_metrics(pipeline,x_train, x_valid, y_train, y_valid)
      scores.append(score)
      
  print("Cross validation scores: ", scores)
  print("Mean cross validation score: ", np.mean(scores))

In [None]:
scaler = MinMaxScaler()
lr = LogisticRegression(C = 0.6,solver='liblinear', penalty = 'l2', random_state=129) # Other solvers have failure to converge problem
pipeline = Pipeline([('scale',scaler), ('lr', lr),])

run_kfold_training(pipeline,X_labled, Y_labled)

In [None]:


low = 0.25
high = 0.5
steps = 10
 
param_grid = {
    'lr__C': [low + k*(high-low)/steps for k in range(steps)],
    #'lr__penalty': ['l1', 'l2', 'none']
    #'lr__solver': ['newton-cg']
}

# 'newton-cg' , 'lbfgs', 'sag'

# Use GridSearchCV to find the best regularization strength
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train, y_train)

# Get the best regularization strength
best_C = grid_search.best_params_['lr__C']
#best_solver = grid_search.best_params_['lr__solver']
#best_penalty = grid_search.best_params_['lr__penalty']

# Retrain the model using the best regularization strength
lr = LogisticRegression(C=best_C, penalty = 'l2', solver='lbfgs', random_state=129)
pipeline = Pipeline([('scale',scaler), ('lr', lr),])
pipeline.fit(x_train, y_train)



# Experimenting with Other Models

In [None]:
# Create an instance of a base classifier
dt = DecisionTreeClassifier(max_depth=1)

# Create an instance of AdaBoostClassifier
adb = AdaBoostClassifier(base_estimator=dt, n_estimators=100)


pipeline = Pipeline([('scale',scaler), ('adb', adb)])
 

run_kfold_training(pipeline,X_labled, Y_labled)

Ada Boosting

In [None]:
scaler = MinMaxScaler()
lr = LogisticRegression(C = 0.5, random_state=129) # Other solvers have failure to converge problem # ,solver='liblinear', penalty = 'l1'

# Create an instance of AdaBoostClassifier
adb = AdaBoostClassifier(base_estimator=lr, n_estimators=100)

pipeline = Pipeline([('scale',scaler), ('adb', adb)])
 

run_kfold_training(pipeline)

Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
gbc = GradientBoostingClassifier(n_estimators=100, random_state=42)

pipeline = Pipeline([('scale',scaler), ('gbc', gbc)])
 

#pipeline.fit(x_train,y_train)
 
run_kfold_training(pipeline)

# Evaluate the Performance of the Model
 In the next section, use the scores obtained from the cross-validation to evaluate the performance of the model. You can use the mean and standard deviation of the scores to get an estimate of the performance of the model.

Hyperparameter tuning: Fine-tune the selected model by searching for the best hyperparameters using techniques like GridSearchCV or RandomizedSearchCV.

Hyperparameter tuning: If the performance of the model is not satisfactory, use techniques like GridSearchCV or RandomizedSearchCV to fine-tune the model by searching for the best hyperparameters.

In [None]:
get_performance_metrics(pipeline,x_train, x_valid, y_train, y_valid)

In [None]:
pipeline

**Train the model on the entire data set**

In [None]:
pipeline.fit(X_labled.values, Y_labled)

In [None]:
get_performance_metrics(pipeline,x_train, x_valid, y_train, y_valid)

In [None]:
unlabled_hand_labled = unlabled.copy()

In [None]:
index = np.random.randint(0,len(unlabled_hand_labled))
print(unlabled_hand_labled.iloc[index].text)
print(submit.target[index])

**Create Submission**

In [None]:
y_test = pipeline.predict(X_unlabled.values)

submit = sample.copy()
submit.target = y_test
submit.to_csv('submit_maybe_better.csv',index=False)

***Run the following if you want to submit now otherwise continue to the next section to try to refine the model***

In [None]:
! kaggle competitions submit -c nlp-getting-started -f submit_maybe_better.csv -m "Message"

# Deep Learning Approach

To keep things nice an neat we copy over some of the interesting features into a new dataframe. 

In this context `dl` will stand for deep learning. We recall the correlations between the various features. 

It would seem that only the `text_len` and `stop_word_count` appear to be correlated with the features. We will leave them all for now.


In [None]:
labled.columns

In [None]:
labled_dl =labled[['keyword','location','text_clean','stop_word_count','text_len','hashtags','target']]
unlabled_dl = unlabled[['keyword','location','text_clean','stop_word_count','text_len','hashtags']]

In [None]:
labled_dl.corr()['target'].drop('target').sort_values()

In [None]:
labled_dl.keyword = labled_dl.keyword.fillna('no')
unlabled_dl.keyword = unlabled_dl.keyword.fillna('no')

labled_dl.location = labled_dl.location.fillna('no')
unlabled_dl.location = unlabled_dl.location.fillna('no')

In [None]:
labled_dl.head(10)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler


# We drop the columns that we don't need for our model.
features_to_drop = []

# We drop the target column from the labled data set (since it is the column that we want to predict).
X_labled = labled_dl.drop(columns = features_to_drop + ['target'])
X_unlabled= unlabled_dl.drop(columns = features_to_drop)

Y_labled = labled_ext.target

In [None]:
X_labled.head(10)

Create at train-test split of the labeled datasets

In [None]:
from keras.preprocessing import text
# using keras tokenizer here
token = text.Tokenizer(num_words=None)


relavent_text_data = [];

features = ['keyword' , 'location', 'text_clean','hashtags']
for f in features:
  relavent_text_data += list(X_labled[f])
  relavent_text_data += list(X_unlabled[f])

token.fit_on_texts(relavent_text_data)
word_index = token.word_index


In [None]:
X_labled.text_clean

In [None]:
sns.histplot(list(labled_dl.text_clean.apply(lambda x : len(x))));

Add the tokenized sentences

In [None]:
def tokenizer(df,feature,padding_len):
  X = token.texts_to_sequences(df[feature])
 
  # zero pad the sequences
  X_pad = pad_sequences(X, maxlen=padding_len)
  
  for i in range(padding_len):
      df[feature +'_' + str(i)] = [x[i] for x in X_pad]


In [None]:
max_len = 120
tokenizer(X_labled,'text_clean',max_len)
tokenizer(X_unlabled,'text_clean',max_len)

tokenizer(X_labled,'keyword',1)
tokenizer(X_unlabled,'keyword',1)


tokenizer(X_labled,'location',1)
tokenizer(X_unlabled,'location',1)

tokenizer(X_labled,'hashtags',1)
tokenizer(X_unlabled,'hashtags',1)

Add the tokenized keyword 

In [None]:
# We drop the columns that we don't need for our model.
features_to_drop = ['keyword'	,'location',	'text_clean'	, 'hashtags']

# We drop the target column from the labled data set (since it is the column that we want to predict).
X_labled = X_labled.drop(columns = features_to_drop)
X_unlabled= X_unlabled.drop(columns = features_to_drop)
 

In [None]:
X_labled.head(10)

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(X_labled.values, Y_labled, 
                                                  stratify=Y_labled, 
                                                  random_state=23, 
                                                  test_size=0.2, shuffle=True)

In [None]:
print (x_train.shape)
print (x_valid.shape)

# Word Embeddings

To import the vectors I found this [link](https://stackoverflow.com/questions/50060241/how-to-use-glove-word-embeddings-file-on-google-colaboratory) which explained everything. Including how to even save the vectors in your drive for later use. 

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
!unzip glove*.zip

In [None]:
!ls
!pwd

In [None]:
# Open the file 'glove.6B.300d.txt' with utf-8 encoding
f = open('glove.6B.300d.txt', encoding='utf-8')

# Create an empty dictionary to store the word vectors
embeddings_index = {}

# Use tqdm to display a progress bar while iterating through the file
for line in tqdm(f):
    # Split the line by space and store the values in a list
    values = line.strip().split(' ')
    
    # The first value in the list is the word
    word = values[0]
    
    # The rest of the values represent the coefficients of the word vector, which are stored as a numpy array
    coefs = np.asarray(values[1:], dtype='float32')
    
    # Add the word vector to the dictionary
    embeddings_index[word] = coefs

# Close the file
f.close()

# Print the number of word vectors found
print('Found %s word vectors.' % len(embeddings_index))

In [None]:
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Deep Learning Models

In [None]:
from keras import backend as K

def f1_score(y_true, y_pred):
    def recall(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision

    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall + K.epsilon()))


In [None]:
# A simple LSTM with glove embeddings and two dense layers
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=x_train.shape[1],
                     trainable=False))

model.add(tf.keras.layers.SpatialDropout1D(0.3))

model.add(tf.keras.layers.LSTM(100, dropout=0.3, recurrent_dropout=0.3,return_sequences=True))

model.add(tf.keras.layers.LSTM(100, dropout=0.3, recurrent_dropout=0.3))

model.add(tf.keras.layers.Dense(1024, activation='relu'))
model.add(tf.keras.layers.Dropout(0.8))

model.add(tf.keras.layers.Dense(1024, activation='relu'))
model.add(tf.keras.layers.Dropout(0.8))


model.add(tf.keras.layers.Dense(1, activation = 'sigmoid'))

# compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics =['accuracy',f1_score])

earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=2, verbose=0, mode='auto')

In [None]:
# Define the early stopping
earlystop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', 
    min_delta=0, 
    patience=8, 
    verbose=0, 
    mode='auto'
)

model.fit(x_train , y_train, batch_size=128, 
          epochs=100, verbose=1, 
          validation_data=(x_valid, y_valid), callbacks =[earlystop])
 

In [None]:
model.save('model.h5')

In [None]:

model = tf.keras.models.load_model("model.h5")


In [None]:
predictions = model.predict(x_train)
predictions = predictions.flatten()

In [None]:
predictions

Maximize F1

In [None]:
from sklearn.metrics import precision_recall_curve, f1_score
import matplotlib.pyplot as plt

# Compute the precision, recall, and threshold values for different thresholds
precision, recall, thresholds = precision_recall_curve(y_train, predictions)

# Compute the F1 score for each threshold
f1 = 2 * (precision * recall) / (precision + recall)

# Plot the F1 score vs. threshold
plt.plot(thresholds, f1[:-1], color='darkorange', lw=2, label='F1 score')

# Set the x and y limits of the plot
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

# Label the x and y axis
plt.xlabel('Threshold')
plt.ylabel('F1 score')

# Give the plot a title
plt.title('F1 score vs. threshold')

# Add the legend to the plot
plt.legend(loc="upper right")

# Show the plot
plt.show()

# Find the optimal threshold (the threshold that maximizes the F1 score)
optimal_idx = np.argmax(f1[:-1])
optimal_threshold = thresholds[optimal_idx]


In [None]:
optimal_threshold

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(Y_labled, predictions)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]


In [None]:
optimal_threshold

Max Accuracy

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Compute the ROC curve, which is a graph showing the true positive rate vs. false positive rate
fpr, tpr, thresholds = roc_curve(y_train, predictions)

# Compute the area under the curve (AUC) of the ROC
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)

# Plot the diagonal line (representing random guess)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

# Set the x and y limits of the plot
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

# Label the x and y axis
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

# Give the plot a title
plt.title('Receiver operating characteristic example')

# Add the legend to the plot
plt.legend(loc="lower right")

# Show the plot
plt.show()

# Find the optimal threshold (the threshold that maximizes the difference between true positive rate and false positive rate)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

In [None]:
predictions = model.predict(X_unlabled.values)

In [None]:
optimal_threshold

In [None]:
predictions_bool = ( predictions > optimal_threshold ) + 0
predictions_bool = predictions_bool.flatten()

In [None]:
predictions_bool = ( predictions > 0.5 ) + 0
predictions_bool = predictions_bool.flatten()

In [None]:
predictions_bool[:15]

Create submission

In [None]:
submit = sample.copy()
submit.target = predictions_bool
submit.to_csv('lstm_120_token_length.csv',index=False)

***Run the following if you want to submit now otherwise continue to the next section to try to refine the model***

In [None]:
! kaggle competitions submit -c nlp-getting-started -f lstm_120_token_length.csv -m "Message"