<a href="https://colab.research.google.com/github/AhsenRiaz/ML-Data/blob/main/05_text_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Data Preprocessing
In this example we will do pre processing on the text data to make the most out of it and extract useful data and make it usable for the model to learn from it.

Libraris:

1. re: A Python module for work with Regular Expressions. Used for pattern matching and much more

2. NLTK: Natural Language ToolKit used for working around with human language

*   stopwords: These are the word most used in the language such as he, she, it, they etc.
*   PorterStemmer: This is a stemming algorithm. A stemming algorithm reduces a word to its base or root form such as enjoying and enjoyed to enjoy.

*   TfidfVectorizer: convert human text to numeric form (more in the feature_extraction example).







In [78]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [79]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [80]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [81]:
news_dataset = pd.read_csv('train.csv', engine='python', on_bad_lines='warn')

news_dataset.head()

Skipping line 1977: unexpected end of data


Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [82]:
news_dataset.shape

(1975, 5)

In [83]:
news_dataset.isnull().sum()

id          0
title      51
author    209
text        4
label       0
dtype: int64

In [84]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('');

In [85]:
# news_dataset.isnull().sum()

news_dataset['text']

0       House Dem Aide: We Didn’t Even See Comey’s Let...
1       Ever get the feeling your life circles the rou...
2       Why the Truth Might Get You Fired October 29, ...
3       Videos 15 Civilians Killed In Single US Airstr...
4       Print \nAn Iranian woman has been sentenced to...
                              ...                        
1970    Country: China North Korea’s announcements of ...
1971    The United States Open added a new $150 millio...
1972    SYDNEY, Australia  —   That map of Australia y...
1973    Next Swipe left/right 7 Tory horror film poste...
1974    Same people all the time , i dont know how you...
Name: text, Length: 1975, dtype: object

In [86]:
# merge author and title and add it to text coloumn
news_dataset['text'] = news_dataset['author'] + ' ' + news_dataset['title']

In [87]:
news_dataset['text']

0       Darrell Lucus House Dem Aide: We Didn’t Even S...
1       Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2       Consortiumnews.com Why the Truth Might Get You...
3       Jessica Purkiss 15 Civilians Killed In Single ...
4       Howard Portnoy Iranian woman jailed for fictio...
                              ...                        
1970    Author USA – China: Who Is Responsible for Nor...
1971    David Waldstein and Ben Rothenberg U.S. Open Q...
1972    Michelle Innis Australia Is Not as Down Under ...
1973    Poke Staff 7 Tory horror film posters to send ...
1974                                           Anonymous 
Name: text, Length: 1975, dtype: object

In [88]:
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [89]:
X.shape

(1975, 4)

In [90]:
Y.shape

(1975,)

In [91]:
port_stem = PorterStemmer()

In [92]:
# stemming function
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]',' ', str(content))
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content
                     if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [95]:
# apply stemming to the text feature
news_dataset['text'] = news_dataset['text'].apply(stemming)

In [94]:
news_dataset['text']

0       darrel lucu hous dem aid even see comey letter...
1       daniel j flynn flynn hillari clinton big woman...
2                  consortiumnew com truth might get fire
3       jessica purkiss civilian kill singl us airstri...
4       howard portnoy iranian woman jail fiction unpu...
                              ...                        
1970    author usa china respons north korean nuclear ...
1971    david waldstein ben rothenberg u open quiet ca...
1972    michel inni australia everyon think new york time
1973    poke staff tori horror film poster send chill ...
1974                                               anonym
Name: text, Length: 1975, dtype: object

In [105]:
# separating the data and the label
X = news_dataset['text'].values
Y = news_dataset['label'].values

In [97]:
print(X)

['darrel lucu hou dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michel inni australia everyon think new york time'
 'poke staff tori horror film poster send chill spine halloween' 'anonym']


In [98]:
print(Y)

[1 0 1 ... 0 1 1]


In [100]:
# Performing TF-IDF
vectorizer =  TfidfVectorizer()

In [101]:
X_vectorized = vectorizer.fit_transform(X)

In [103]:
# TF-IDF score of each word
print(X_vectorized)

  (0, 5078)	0.2988906032626825
  (0, 818)	0.3552572454411006
  (0, 2539)	0.24736668016881966
  (0, 2827)	0.2988906032626825
  (0, 975)	0.2563956294363691
  (0, 4342)	0.26169514290578866
  (0, 1642)	0.2516550530981723
  (0, 96)	0.28284242060994297
  (0, 1272)	0.27463906992696424
  (0, 2315)	0.23193627189537763
  (0, 2918)	0.3370138049505055
  (0, 1205)	0.3370138049505055
  (1, 635)	0.15871756084162852
  (1, 736)	0.3880779576281725
  (1, 5384)	0.308265487806722
  (1, 501)	0.3013470519256807
  (1, 922)	0.20436653604578373
  (1, 2262)	0.19695229440722908
  (1, 1835)	0.6883566973493513
  (1, 1197)	0.28978569304450547
  (2, 1795)	0.35635819757989223
  (2, 1995)	0.35154712234393204
  (2, 3131)	0.48492083192704566
  (2, 5061)	0.4232302699121657
  (2, 970)	0.31623277493920554
  :	:
  (1971, 725)	0.2319547590222525
  (1971, 465)	0.2398693899305878
  (1971, 3459)	0.2416342218844996
  (1971, 4950)	0.09781243140502571
  (1971, 5451)	0.09862991369841745
  (1971, 3331)	0.09544548619542835
  (1972, 24