**Classification de tweets (textes) en cas de catastrophe**

Nous allons à travers ce projet classifier des textes (tweets) provenant de Twitter pour déterminer s'il y a catastrophe ou pas. 
L'ensemble de données utilisé contient:

train.csv qui a des instances (textes de Twitter) 7613 lignes et atrributs.

id: un identifiant unique de l'utilisateur
mot-clé: un mot particulier du tweet
location: l'emplacement à partir duquel le tweet a été envoyé
texte: le texte du tweet
cible: cela indique si un tweet concerne une véritable catastrophe (1) ou non (0)

test.csv qui a des instances (textes de Twitter) 3263 lignes et attributs.

id: un identifiant unique de l'utilisateur
mot-clé: un mot particulier du tweet
location: l'emplacement à partir duquel le tweet a été envoyé
texte: le texte du tweet
submit.csv: Cela collectera l'id et la cible qui indiquent si le message à l'id est un message lié à une catastrophe ou non.

In [70]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
nltk.download('stopwords')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/content/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
/content/train.csv
/content/submission.csv
/content/sample_submission.csv
/content/test.csv
/content/.config/.last_opt_in_prompt.yaml
/content/.config/active_config
/content/.config/.last_update_check.json
/content/.config/.last_survey_prompt.yaml
/content/.config/config_sentinel
/content/.config/gce
/content/.config/.metricsUUID
/content/.config/configurations/config_default
/content/.config/logs/2020.11.13/17.32.45.071309.log
/content/.config/logs/2020.11.13/17.33.29.478721.log
/content/.config/logs/2020.11.13/17.33.44.836274.log
/content/.config/logs/2020.11.13/17.33.22.211003.log
/content/.config/logs/2020.11.13/17.33.45.553060.log
/content/.config/logs/2020.11.13/17.33.07.342211.log
/content/sample_data/anscombe.json
/content/sample_data/README.md
/content/sample_data/california_housing_train.csv
/content/sample_data/mnist_test.csv
/content/sample_data/california_

In [71]:
train = pd.read_csv('../content/train.csv')
test = pd.read_csv('../content/test.csv')
train.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [72]:
train.count()

id          7613
keyword     7552
location    5080
text        7613
target      7613
dtype: int64

In [73]:
test.count()

id          3263
keyword     3237
location    2158
text        3263
dtype: int64

In [74]:
from nltk.corpus import stopwords
import re
import string

In [75]:
def change_contraction_verb(text):
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    
    # specific
    text = re.sub(r"won\'t", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    return text

train['text'] = train['text'].apply(lambda x : change_contraction_verb(x))
test['text'] = test['text'].apply(lambda x : change_contraction_verb(x))

train['text'].head(10)

0    Our Deeds are the Reason of this #earthquake M...
1               Forest fire near La Ronge Sask. Canada
2    All residents asked to  ishelter in place' are...
3    13,000 people receive #wildfires evacuation or...
4    Just got sent this photo from Ruby #Alaska as ...
5    #RockyFire Update => California Hwy. 20 closed...
6    #flood #disaster Heavy rain causes flash flood...
7    I am on top of the hill and I can see a fire i...
8    There is an emergency evacuation happening now...
9    I am afraid that the tornado is coming to our ...
Name: text, dtype: object

In [76]:
def custom_preprocessor(text):
    '''
    Make text lowercase, remove text in square brackets,remove links,remove special characters
    and remove words containing numbers.
    '''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) # remove special chars
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

train['text'] = train['text'].apply(lambda x : custom_preprocessor(x))
test['text'] = test['text'].apply(lambda x : custom_preprocessor(x))

train['text'].head(10)

0    our deeds are the reason of this  earthquake m...
1               forest fire near la ronge sask  canada
2    all residents asked to  ishelter in place  are...
3      people receive  wildfires evacuation orders ...
4    just got sent this photo from ruby  alaska as ...
5     rockyfire update    california hwy   closed i...
6     flood  disaster heavy rain causes flash flood...
7    i am on top of the hill and i can see a fire i...
8    there is an emergency evacuation happening now...
9    i am afraid that the tornado is coming to our ...
Name: text, dtype: object

In [77]:
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
train['text'] = train['text'].apply(lambda x : remove_emoji(x))
test['text'] = test['text'].apply(lambda x : remove_emoji(x))

In [78]:
train.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this earthquake m...,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,all residents asked to ishelter in place are...,1
3,6,,,people receive wildfires evacuation orders ...,1
4,7,,,just got sent this photo from ruby alaska as ...,1
5,8,,,rockyfire update california hwy closed i...,1
6,10,,,flood disaster heavy rain causes flash flood...,1
7,13,,,i am on top of the hill and i can see a fire i...,1
8,14,,,there is an emergency evacuation happening now...,1
9,15,,,i am afraid that the tornado is coming to our ...,1


In [79]:
test.head(10)

Unnamed: 0,id,keyword,location,text
0,0,,,just happened a terrible car crash
1,2,,,heard about earthquake is different cities s...
2,3,,,there is a forest fire at spot pond geese are...
3,9,,,apocalypse lighting spokane wildfires
4,11,,,typhoon soudelor kills in china and taiwan
5,12,,,we are shaking it is an earthquake
6,21,,,they would probably still show more life than ...
7,22,,,hey how are you
8,27,,,what a nice hat
9,29,,,fuck off


In [80]:
#Stopwords

from sklearn.feature_extraction.text import CountVectorizer

stopwords = stopwords.words('english')

print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [81]:
#CountVectorizer

count_vectorizer = CountVectorizer(token_pattern=r'\w{1,}', ngram_range=(1, 2), stop_words = stopwords)

train_vector = count_vectorizer.fit_transform(train['text'])
test_vector = count_vectorizer.transform(test['text'])

In [82]:
train_vector.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [83]:
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection

clf = LogisticRegression()

#Check accuracy score from K-Fold cross validation
scores = model_selection.cross_val_score(clf, train_vector, train["target"], cv=5, scoring="accuracy")
print(scores)

[0.72619829 0.65068943 0.70584373 0.68593955 0.76938239]


In [84]:
#get the mean of each fold 
print("Accuracy of Model with Cross Validation is: ",scores.mean() * 100)

Accuracy of Model with Cross Validation is:  70.76106791785699


In [85]:
# Fitting a simple Logistic Regression on Counts
clf.fit(train_vector, train["target"])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [86]:
# Submission
submission = pd.read_csv("../content/sample_submission.csv")
submission["target"] = clf.predict(test_vector)
submission.to_csv("submission.csv", index=False)

In [87]:
submission.head(20)

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,0
4,11,1
5,12,1
6,21,0
7,22,0
8,27,0
9,29,0
