In this notebook 

- Text Preprocessing
 
- Create the Document-Term Matrix and TF-IDF
   
- Apply Spacy

### Setup

Data manipulation libraries:

In [1]:
import pandas as pd
import numpy  as np

Visualization libraries:

In [2]:
import matplotlib.pyplot  as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

Pre-processing libraries:

In [3]:
import string as st
import re
import nltk

In [4]:
from sklearn.feature_extraction.text  import CountVectorizer , TfidfVectorizer
from nltk.tokenize                    import word_tokenize
from nltk                             import WordNetLemmatizer

Classes instances

In [5]:
lemmatizer = WordNetLemmatizer()
cv         = CountVectorizer(stop_words='english', max_df=5)
cv_tfidf   = TfidfVectorizer(stop_words='english', max_df=5)

### Data-set

In [6]:
file_path = 'English_Global_news.csv'

English_Global_news = pd.read_csv(file_path , index_col=0)

  mask |= (ar1 == a)


### Data Overview

In [7]:
English_Global_news.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1263809 entries, 0 to 3327276
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Text      1263809 non-null  object
 1   language  1263809 non-null  object
dtypes: object(2)
memory usage: 28.9+ MB


In [8]:
English_Global_news.head(2)

Unnamed: 0,Text,language
0,Here Are the Details on Facebook's Global Part...,English
3,Petrol & diesel on the rise post daily price r...,English


In [9]:
English_Global_news.sample(2)

Unnamed: 0,Text,language
450228,Fin24.com - Gigaba and CEOs air concerns over ...,English
152785,Sarah Harding takes swipe at 'savvy' Cheryl ov...,English


In [10]:
English_Global_news.tail(2)

Unnamed: 0,Text,language
3327272,armed forces may be organized as standing forc...,English
3327276,the total high school population was now appro...,English


In [11]:
print('Data has {} rows and {} columns'.format(English_Global_news.shape[0], English_Global_news.shape[1]))

Data has 1263809 rows and 2 columns


### Checking for NaN

In [12]:
English_Global_news.isnull().values.any()

False

### Droping any duplicates

In [13]:
English_Global_news = English_Global_news.drop_duplicates()

In [14]:
print('After removing (Nans - duplicates) the data has {} rows and {} columns'
      .format(English_Global_news.shape[0], English_Global_news.shape[1]))

After removing (Nans - duplicates) the data has 867786 rows and 2 columns


### Remove the unneeded column

In [15]:
English_Global_news = English_Global_news.drop('language', axis=1, errors='ignore')

In [16]:
English_Global_news.shape

(867786, 1)

###  Taking a sample 

In [17]:
English_Global_news_sample = English_Global_news[:15000]

In [18]:
English_Global_news_sample.shape

(15000, 1)

### Text Preprocessing

In [19]:
def precosseing_pipeline(text):
        # remove urls
        text = re.sub(r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', ' ', text)
        # remove punctuations 
        text = ("".join([ch for ch in text if ch not in st.punctuation]))
        # remove non-alphanumeric characters
        text = re.sub(r'[^a-zA-Z]', ' ', text)
        # lower casing
        text = text.lower()
        # convert text to tokens
        text = re.split('\s+' ,text)
        tokens = [x.lower() for x in text]
        # remove stopwords using NLTK corpus stopwords list to match
        tokens = [word for word in text if word not in nltk.corpus.stopwords.words('english')]
        # convert words to feature vectors
        text = " ".join([word for word in tokens])     
        return text

In [20]:
English_Global_news_sample['Text'] = English_Global_news_sample['Text'].apply(precosseing_pipeline)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  English_Global_news_sample['Text'] = English_Global_news_sample['Text'].apply(precosseing_pipeline)


### Apply Lemmatization

- Lemmatization: cut word down to base form using vocabulary and    morphological analysis.

In [21]:
def apply_lemmatize(text):
    text_split = text.split(' ')
    lem_v_text = ''
    
    for text in text_split:
        lem_v_text += lemmatizer.lemmatize(text, pos='v') + ' '
        text_split  = lem_v_text.split(' ')
        lem_text    =''
        
    for text in text_split:
        lem_text += lemmatizer.lemmatize(text, pos='a') + ' '
    return lem_text

In [22]:
English_Global_news_sample['Text_lemma'] = English_Global_news_sample['Text'].apply(apply_lemmatize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  English_Global_news_sample['Text_lemma'] = English_Global_news_sample['Text'].apply(apply_lemmatize)


In [23]:
English_Global_news_sample.sample(5)

Unnamed: 0,Text,Text_lemma
23865,mcgregor promises break old man final press co...,mcgregor promise break old man final press con...
6779,six democrats could beat trump including one t...,six democrats could beat trump include one tex...
19445,sirona biochem receives tsx venture exchange a...,sirona biochem receive tsx venture exchange ap...
32115,lottery hidden secret billions unclaimed prize...,lottery hide secret billions unclaimed prize p...
24287,bomber oozing confidence,bomber ooze confidence


### Document Similarity

- Document-Term Matrix

In [24]:
X = English_Global_news_sample.Text_lemma
X_cv =cv.fit_transform(X)

In [25]:
Document_TM = pd.DataFrame(X_cv.toarray(),columns=cv.get_feature_names())

In [26]:
Document_TM.head()

Unnamed: 0,aa,aaa,aaco,aadhaar,aafias,aamc,aampw,aap,aapl,aar,...,zosano,zska,zucchini,zuckerberg,zumas,zumbrunwall,zwave,zweig,zyme,zzyzx
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Check cosine similarity

In [27]:
from sklearn.metrics.pairwise import cosine_similarity,pairwise_distances

In [28]:
cosine_similarity(Document_TM)

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [29]:
pairwise_distances(Document_TM,metric='cosine')

array([[0.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, ...,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
       ...,
       [1.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        0.00000000e+00, 1.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        1.00000000e+00, 2.22044605e-16, 1.00000000e+00],
       [1.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        1.00000000e+00, 1.00000000e+00, 2.22044605e-16]])

- TF-IDF

In [30]:
X_tfidf  = cv_tfidf.fit_transform(X).toarray()

In [31]:
TF_IDF   = pd.DataFrame(X_tfidf, columns = cv_tfidf.get_feature_names())

In [32]:
TF_IDF

Unnamed: 0,aa,aaa,aaco,aadhaar,aafias,aamc,aampw,aap,aapl,aar,...,zosano,zska,zucchini,zuckerberg,zumas,zumbrunwall,zwave,zweig,zyme,zzyzx
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Check cosine similarity

In [33]:
cosine_similarity(TF_IDF)

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [34]:
pairwise_distances(TF_IDF,metric='cosine')

array([[0., 1., 1., ..., 1., 1., 1.],
       [1., 0., 1., ..., 1., 1., 1.],
       [1., 1., 0., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 0., 1., 1.],
       [1., 1., 1., ..., 1., 0., 1.],
       [1., 1., 1., ..., 1., 1., 0.]])

###  Apply SpaCy

In [35]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [36]:
English_Global_news_sample['spacy_doc'] = English_Global_news_sample['Text'].apply(nlp)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  English_Global_news_sample['spacy_doc'] = English_Global_news_sample['Text'].apply(nlp)


In [37]:
English_Global_news_sample.head()

Unnamed: 0,Text,Text_lemma,spacy_doc
0,details facebooks global partner summit,detail facebooks global partner summit,"(details, facebooks, global, partner, summit)"
3,petrol diesel rise post daily price revisions ...,petrol diesel rise post daily price revisions ...,"(petrol, diesel, rise, post, daily, price, rev..."
4,could deshone kizer end browns history qb misf...,could deshone kizer end brown history qb misfo...,"(could, deshone, kizer, end, browns, history, ..."
5,comment microsoft never sneakily force windows...,comment microsoft never sneakily force windows...,"(comment, microsoft, never, sneakily, force, w..."
6,comment google chrome enterprise techfan,comment google chrome enterprise techfan,"(comment, google, chrome, enterprise, techfan)"


In [38]:
tokens = []
lemma = []

for doc in nlp.pipe(English_Global_news_sample['Text'].astype('unicode').values, batch_size=50,
                       ):
    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])

    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        

English_Global_news_sample['species_tokens'] = tokens
English_Global_news_sample['species_lemma']  = lemma

  if doc.is_parsed:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  English_Global_news_sample['species_tokens'] = tokens
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  English_Global_news_sample['species_lemma']  = lemma


In [39]:
English_Global_news_sample.sample(2)

Unnamed: 0,Text,Text_lemma,spacy_doc,species_tokens,species_lemma
353,wdrb heine brothers raise kentucky science cen...,wdrb heine brothers raise kentucky science cen...,"(wdrb, heine, brothers, raise, kentucky, scien...","[wdrb, heine, brothers, raise, kentucky, scien...","[wdrb, heine, brother, raise, kentucky, scienc..."
21624,honour uni getting students help,honour uni get students help,"(honour, uni, getting, students, help)","[honour, uni, getting, students, help]","[honour, uni, get, student, help]"


In [40]:
def get_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text) 
    return text

In [41]:
English_Global_news_sample['species_lemma'] = English_Global_news_sample['species_lemma'].astype(str).apply(get_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  English_Global_news_sample['species_lemma'] = English_Global_news_sample['species_lemma'].astype(str).apply(get_text)


In [42]:
English_Global_news_sample.sample(5)

Unnamed: 0,Text,Text_lemma,spacy_doc,species_tokens,species_lemma
23868,settlement means cash calls promising free cru...,settlement mean cash call promise free cruise,"(settlement, means, cash, calls, promising, fr...","[settlement, means, cash, calls, promising, fr...",settlement mean cash call promis...
27750,americans wish luck million powerball lottery ...,americans wish luck million powerball lottery ...,"(americans, wish, luck, million, powerball, lo...","[americans, wish, luck, million, powerball, lo...",americans wish luck million powe...
34886,josh harrison spoils rich hills nohit bid thin...,josh harrison spoil rich hill nohit bid thin w...,"(josh, harrison, spoils, rich, hills, nohit, b...","[josh, harrison, spoils, rich, hills, nohit, b...",josh harrison spoil rich hill ...
24980,grenfell pm saves face admitting council flaws,grenfell pm save face admit council flaw,"(grenfell, pm, saves, face, admitting, council...","[grenfell, pm, saves, face, admitting, council...",grenfell pm save face admit c...
13128,deciphering sweet satisfaction corn,decipher sweet satisfaction corn,"(deciphering, sweet, satisfaction, corn)","[deciphering, sweet, satisfaction, corn]",decipher sweet satisfaction corn


### Store the DFs as CSV

- Document_TM

In [43]:
Document_TM.to_csv('Document_TM.csv')

- TF_IDF

In [44]:
TF_IDF.to_csv('TF_IDF.csv')

- English_Global_news_sample

In [45]:
English_Global_news_sample.to_csv('Sample.csv')