<a href="https://colab.research.google.com/github/Princess-Mcdonald/Ai-School-Team3-NLP-Project/blob/main/Princess_Ai_School_Team3_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Notebook Imports**

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from wordcloud import WordCloud as WC
from PIL import Image
import re

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

### This is the pre-processing stage.
 Which involves;
 <br>
 1) Coverting the text to lower case.
 <br>
 2) Tokenization: This involves the spliting up sentences into individual words.
 <br>
 3) Removing of stop words.
 <br>
 4) Stripping out HTML tags .
 <br>
 5) Word stemming: Stemming is the process of reducing words to their base or root form
 <br>
 6) Removing punctuations.

In [27]:
def hashtags(text):
    re_hash = re.compile("#\S+")
    result = re_hash.findall(text)
    if result:
        return ", ".join(result)
    return result.append("None")

def remove_url(text):
    re_url = re.compile("https+://\S+|www\.\S+")
    return re_url.sub("", text)

def stemmer(text):
    stemmer = SnowballStemmer("english")

    wordlist = word_tokenize(text)

    words = [stemmer.stem(word) for word in wordlist]

    return " ".join(words)

In [28]:
df = pd.read_csv("/content/AI-SCHOOL-GRUOP-3-PROJECT1-CODE-ON-TITANIC-PASSENGERS-SURVIVAL-PREDICTION-main/nlp-twitter/train.csv")

## **Checking the data**
---We check the data set so we can know the type of data set we are working with.


In [29]:
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


### **Checking some informations on the data frame.**


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


### **Checking if there are missing datas and getting the total of the missing data.**

In [31]:
df.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [32]:
df["keyword"] = df["keyword"].str.replace("%20", " ")
df["keyword"]

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
7608    NaN
7609    NaN
7610    NaN
7611    NaN
7612    NaN
Name: keyword, Length: 7613, dtype: object

In [33]:
mode_keyword = df['keyword'].mode().values[0]
mode_keyword

'fatalities'

In [34]:
df["keyword"].fillna(mode_keyword, inplace=True)

## **Feature Seletion**
--- This involves the droping of features that are not needed. This will enable our modelling to be more accurate. 

In [35]:
df.drop("location", axis=1, inplace=True)

In [36]:
df.head()

Unnamed: 0,id,keyword,text,target
0,1,fatalities,Our Deeds are the Reason of this #earthquake M...,1
1,4,fatalities,Forest fire near La Ronge Sask. Canada,1
2,5,fatalities,All residents asked to 'shelter in place' are ...,1
3,6,fatalities,"13,000 people receive #wildfires evacuation or...",1
4,7,fatalities,Just got sent this photo from Ruby #Alaska as ...,1


In [37]:
df.set_index("id", drop=True, inplace=True)

In [38]:
df["text"] = df["text"].str.lower()

In [39]:
df["text"] = df["text"].apply(remove_url)

In [40]:
df["hashtags"] = df["text"].apply(hashtags)

In [41]:
df.head()

Unnamed: 0_level_0,keyword,text,target,hashtags
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,fatalities,our deeds are the reason of this #earthquake m...,1,#earthquake
4,fatalities,forest fire near la ronge sask. canada,1,
5,fatalities,all residents asked to 'shelter in place' are ...,1,
6,fatalities,"13,000 people receive #wildfires evacuation or...",1,#wildfires
7,fatalities,just got sent this photo from ruby #alaska as ...,1,"#alaska, #wildfires"
