<font color="green">**Jupyter notebook for preprocessing news data**</font>

In [2]:
import pandas as pd
import re
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS



**Read news from raw input files**

In [21]:
#company = "AMD"
#company = "Apple"
#company = "Disney"
company = "Tesla"
input_file = "D:\\cmpe295b\\news_data\\" + company + ".csv"
output_file = "processed" + company + ".csv"

In [22]:
amddata = pd.read_csv(input_file)
amddata.head()

Unnamed: 0,company,news,link,date,body
0,Tesla Inc (TSLA.OQ),Tesla's Elon Musk calls for breakup of Amazon ...,https://www.reuters.com/article/idUSKBN23B307,2020-06-05 06:10:00,"(This June 4 story corrects to read Tesla, pa..."
1,Tesla Inc (TSLA.OQ),CORRECTED-UPDATE 1-Tesla's Elon Musk calls for...,https://www.reuters.com/article/idUSL1N2DH2HU,2020-06-05 06:06:00,"(This June 4 story corrects to read Tesla, pa..."
2,Tesla Inc (TSLA.OQ),CORRECTED-Tesla's Elon Musk calls for breakup ...,https://www.reuters.com/article/idUSFWN2DH0FX,2020-06-05 05:27:00,"(Corrects JUNE 4 story to read Tesla, paragrap..."
3,Tesla Inc (TSLA.OQ),"Breakingviews - Corona Capital: ZoomInfo IPO, ...",https://www.reuters.com/article/idUSKBN23B35T,2020-06-04 16:15:00,NEW YORK/LONDON/HONG KONG (Reuters Breakingvie...
4,Tesla Inc (TSLA.OQ),BRIEF-Tesla Daily Has Joined Maven's Coalition...,https://www.reuters.com/article/idUSFWN2DH0DI,2020-06-04 13:34:00,June 4 (Reuters) - Themaven Inc: \n* MAVEN - T...


**Only take data and headline news column into a new dataframe**

In [23]:
filterData = amddata[['date','news']].copy()
filterData.head()

Unnamed: 0,date,news
0,2020-06-05 06:10:00,Tesla's Elon Musk calls for breakup of Amazon ...
1,2020-06-05 06:06:00,CORRECTED-UPDATE 1-Tesla's Elon Musk calls for...
2,2020-06-05 05:27:00,CORRECTED-Tesla's Elon Musk calls for breakup ...
3,2020-06-04 16:15:00,"Breakingviews - Corona Capital: ZoomInfo IPO, ..."
4,2020-06-04 13:34:00,BRIEF-Tesla Daily Has Joined Maven's Coalition...


**Remove Upper case prefixes with '-'**

**e.g. BRIEF - AMD stocks rises amid positive financial news.
We must remove the 'BRIEF -' as it does not add any value.**

In [24]:
def removePre(headline):
    i = headline.find('-')
    if i != -1 and headline[:i].isupper():
        headline = headline[i+1:]
    headline = headline.strip()
    return headline

filterData['news'] = filterData['news'].apply(removePre)

In [25]:
filterData.head(10)

Unnamed: 0,date,news
0,2020-06-05 06:10:00,Tesla's Elon Musk calls for breakup of Amazon ...
1,2020-06-05 06:06:00,UPDATE 1-Tesla's Elon Musk calls for breakup o...
2,2020-06-05 05:27:00,Tesla's Elon Musk calls for breakup of Amazon ...
3,2020-06-04 16:15:00,"Breakingviews - Corona Capital: ZoomInfo IPO, ..."
4,2020-06-04 13:34:00,Tesla Daily Has Joined Maven's Coalition Of In...
5,2020-06-04 13:12:00,Germany rebuffs gasoline auto lobby with radic...
6,2020-06-04 13:09:00,UPDATE 2-Germany rebuffs gasoline auto lobby w...
7,2020-06-04 13:06:00,America's billionaire wealth jumps by over hal...
8,2020-06-04 12:55:00,America's billionaire wealth jumps by over hal...
9,2020-06-04 12:03:00,Germany will require all petrol stations to pr...


**Remove time from date column. We may need it later though.**

In [26]:
def removeTime(datetime):
    arr = datetime.split()
    return arr[0]

filterData['date'] = filterData['date'].apply(removeTime)

**Remove special characters**

In [27]:
lemmatizer = WordNetLemmatizer()
def removeChars(news):
    n = " ".join(lemmatizer.lemmatize(word) for word in news.split())
    n = " ".join([word.strip(",;:-") for word in n.split() if word not in ENGLISH_STOP_WORDS])
    n = n.replace('-','')
    n = n.replace('\'',' ')
    n = " ".join([word for word in n.split() if len(word) > 1])
    n = n.replace('"','')
    return n
    #return re.sub('[^A-Za-z0-9]+', ' ', n)

filterData['news'] = filterData['news'].apply(removeChars)

**Remove rows having same news text**

In [28]:
filterData = filterData.drop_duplicates(subset=['news'])

**Write the result into a new file**

In [29]:
filterData.to_csv(output_file,index=False)