First step would be to import all the necessary libraries. Since this is an NLP problem hence we'll have to mostly deal with strings.

In [1]:
import re
import string
import pandas as pd
import nltk

In [2]:
df = pd.read_csv("Inshorts App Data.csv")

In [3]:
df.head()

Unnamed: 0,Headline,Short,Source,Time,Publish Date,Subject
0,Was stopped from entering my own studio at Tim...,TV news anchor Arnab Goswami has said he was t...,YouTube,23:24:00,3/25/2017,Politics
1,New trailer of &#39;Justice League&#39; released,A new trailer for the upcoming superhero film ...,YouTube,21:50:00,3/25/2017,Entertainment
2,His touch was not right: Shilpa Shinde on sexu...,"Television actress Shilpa Shinde, while openin...",The Quint,21:18:00,3/25/2017,Entertainment
3,Anti-Romeo squads must not trouble consenting ...,"Uttar Pradesh Chief Minister Yogi Adityanath, ...",ANI,23:05:00,3/25/2017,Politics
4,Both Romeo and Juliet are welcome in Delhi: AA...,In an apparent jibe at UP&#39;s anti-Romeo squ...,India Today,9:26:00,3/26/2017,Politics




The data suggests us that the column 'Subject' would be our target variable. We have to build a model that helps us in predicting about the news subject it belongs to based on the other feature variables. Prima facie observation of the data tells us that the columns 'Source', 'Time' and 'Publish Date' would help us little in achieving our goal hence it is best to drop 
them & continue with our problem. 






In [4]:
df.columns

Index(['Headline', 'Short', 'Source ', 'Time ', 'Publish Date', 'Subject'], dtype='object')

In [7]:
df.drop(['Source ','Time ','Publish Date'], axis =1, inplace= True)

In [8]:
df.head()

Unnamed: 0,Headline,Short,Subject
0,Was stopped from entering my own studio at Tim...,TV news anchor Arnab Goswami has said he was t...,Politics
1,New trailer of &#39;Justice League&#39; released,A new trailer for the upcoming superhero film ...,Entertainment
2,His touch was not right: Shilpa Shinde on sexu...,"Television actress Shilpa Shinde, while openin...",Entertainment
3,Anti-Romeo squads must not trouble consenting ...,"Uttar Pradesh Chief Minister Yogi Adityanath, ...",Politics
4,Both Romeo and Juliet are welcome in Delhi: AA...,In an apparent jibe at UP&#39;s anti-Romeo squ...,Politics


Both 'Headline' & 'Short' has content that can help us in the identification of the news subject they belong to. Our strategy would be to transform the two feature variables to such an extent where they would be ready for us to quantify them and apply a statistical approach.
Therefore following are the cleaning steps that we are going to undertake:

1.) Removing special characters such as  $,@,#,! etc.

2.) Removing punctuations such as  ,.:; etc.

3.) Removing numbers such as 123,4,5 etc.

4.) Converting strings to lowercase.

5.) Removing stopwords such as i,me,myself (in total 179 elements).

6.) Stemming or Lemmatization based on which helps.

Steps 1 to 4 can be directly embedded into the function as they require 1 line of code. For steps five and six we need to create seperate functions which could then be further used as part of the main function.

In [9]:
from nltk.tokenize import ToktokTokenizer
tokenizer = ToktokTokenizer()


In [11]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [12]:
def remove_stopwords(text):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    t =[token for token in tokens if token.lower() not in stopword_list]
    text = ' '.join(t)
    return text

In [13]:
def get_stem(text):
    stemmer = nltk.porter.PorterStemmer()
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    return text

In [14]:
from nltk.stem import WordNetLemmatizer

In [15]:
wordnet_lemmatizer = WordNetLemmatizer()

In [16]:
def lemma(message):
    k =[]
    for word in message.split():
        k.append(wordnet_lemmatizer.lemmatize(word))
    return ' '.join([m for m in k])    

Since now we are done with basic functions its time for us to finally make an overarching function which gives us the required output. Also whether Stemming works or Lemmatization, we need to identify it by comparing the two results.

In [17]:
def final_lemma(text):
    pat = r'[$@#!&]'
    special_removed = re.sub(pat,'',text)
    punc_removed = ''.join([c for c in special_removed if c not in string.punctuation])
    num_removed= re.sub(r'\d+',"",punc_removed)
    lower = num_removed.lower()
    stop_removed = remove_stopwords(lower)
    return lemma(stop_removed)
    
    

In [18]:
def final_stem(text):
    pat = r'[$@#!&]'
    special_removed = re.sub(pat,'',text)
    punc_removed = ''.join([c for c in special_removed if c not in string.punctuation])
    num_removed= re.sub(r'\d+',"",punc_removed)
    lower = num_removed.lower()
    stop_removed = remove_stopwords(lower)
    return get_stem(stop_removed)

In [19]:
df['Headline_lemma'] = df['Headline'].apply(final_lemma)

In [20]:
df['Headline_lemma']

0                       stopped entering studio time arnab
1                      new trailer justice league released
2              touch right shilpa shinde sexual harassment
3        antiromeo squad must trouble consenting youth ...
4                  romeo juliet welcome delhi aap minister
                               ...                        
25191                     delhi govt school asked work day
25192                 compromise gujarat govt hardik patel
25193             want play putin screen leonardo dicaprio
25194                      sensex loses point hit week low
25195           ghulam ali set make acting debut bollywood
Name: Headline_lemma, Length: 25196, dtype: object

In [21]:
df['Headline_stem'] = df['Headline'].apply(final_stem)

In [22]:
df['Headline_stem']

0                          stop enter studio time arnab
1                       new trailer justic leagu releas
2                touch right shilpa shind sexual harass
3        antiromeo squad must troubl consent youth yogi
4                  romeo juliet welcom delhi aap minist
                              ...                      
25191                    delhi govt school ask work day
25192               compromis gujarat govt hardik patel
25193          want play putin screen leonardo dicaprio
25194                    sensex lose point hit week low
25195           ghulam ali set make act debut bollywood
Name: Headline_stem, Length: 25196, dtype: object



As can be seen from the comparison of the two outputs that the Stemmed transformation has resulted in the formation of some absurd words such as leagu,justic,releas,welcom,troubl etc. Hence we will drop that result and go forward with the lemmatized transformation.  

In [23]:
df.columns

Index(['Headline', 'Short', 'Subject', 'Headline_lemma', 'Headline_stem'], dtype='object')

In [24]:
df.drop('Headline_stem', axis =1, inplace = True)

In [25]:
df.head()

Unnamed: 0,Headline,Short,Subject,Headline_lemma
0,Was stopped from entering my own studio at Tim...,TV news anchor Arnab Goswami has said he was t...,Politics,stopped entering studio time arnab
1,New trailer of &#39;Justice League&#39; released,A new trailer for the upcoming superhero film ...,Entertainment,new trailer justice league released
2,His touch was not right: Shilpa Shinde on sexu...,"Television actress Shilpa Shinde, while openin...",Entertainment,touch right shilpa shinde sexual harassment
3,Anti-Romeo squads must not trouble consenting ...,"Uttar Pradesh Chief Minister Yogi Adityanath, ...",Politics,antiromeo squad must trouble consenting youth ...
4,Both Romeo and Juliet are welcome in Delhi: AA...,In an apparent jibe at UP&#39;s anti-Romeo squ...,Politics,romeo juliet welcome delhi aap minister


Now we will apply the same function to the 'Short' column as well. After that we need to remove the original columns as they are now of no use and the transformed columns will do the task for us.

In [26]:
df['Short_lemma'] = df['Short'].apply(final_lemma)

In [27]:
df['Short_lemma']

0        tv news anchor arnab goswami said told could p...
1        new trailer upcoming superhero film justice le...
2        television actress shilpa shinde opening claim...
3        uttar pradesh chief minister yogi adityanath v...
4        apparent jibe ups antiromeo squad delhi touris...
                               ...                        
25191    delhi government tuesday notified government s...
25192    patidar leader hardik patel tuesday rejected p...
25193    hollywood actor leonardo dicaprio recently exp...
25194    tracking weak cue asian market benchmark sense...
25195    pakistani ghazal singer ghulam ali soon make a...
Name: Short_lemma, Length: 25196, dtype: object

In [28]:
df.columns

Index(['Headline', 'Short', 'Subject', 'Headline_lemma', 'Short_lemma'], dtype='object')

In [29]:
df.drop(['Headline', 'Short'], axis =1, inplace = True )

In [30]:
df.head()

Unnamed: 0,Subject,Headline_lemma,Short_lemma
0,Politics,stopped entering studio time arnab,tv news anchor arnab goswami said told could p...
1,Entertainment,new trailer justice league released,new trailer upcoming superhero film justice le...
2,Entertainment,touch right shilpa shinde sexual harassment,television actress shilpa shinde opening claim...
3,Politics,antiromeo squad must trouble consenting youth ...,uttar pradesh chief minister yogi adityanath v...
4,Politics,romeo juliet welcome delhi aap minister,apparent jibe ups antiromeo squad delhi touris...


Now we will save this data frame as csv file which could be further used for next steps.

In [31]:
pwd

'C:\\Users\\admin\\Downloads\\Capstone Project - Inshorts'

In [32]:
df.to_csv(r'C:\\Users\\admin\\Downloads\\Capstone Project - Inshorts\inshorts data preprocessed.csv', index = False)