# News-Link: Data Cleaning

To build the corpus of news articles associated with the **News-Link** app, I have been scraping articles, daily from the CBC.ca. I will now merge and clean all of these files. I will also isolate links inserted by journalists to prepare a separate csv file that has links to additional stories that require scraping. 

In [3]:
#Python necessary packages and functions 
import pandas as pd
import numpy as np
import sys

#specifying local system path for remaining functions files
sys.path.insert(1, '/Users/Cylita/Desktop/insight-ds-project_news-link/scripts')

#self written functions and scripts that will be needed
import text_normalization_funs as TN
import similiarity_funcs as SIM

In [14]:
#Loading in required raw data for processing, dropping all rows that contain NAs
newsJan20 = pd.read_csv('/Users/Cylita/Desktop/insight-ds-project_news-link/data/raw/Jan_20_news.csv').dropna(subset=['title', 'maintext'])
newsJan21 = pd.read_csv('/Users/Cylita/Desktop/insight-ds-project_news-link/data/raw/Jan_21_news.csv').dropna(subset=['title', 'maintext'])
newsJan23 = pd.read_csv('/Users/Cylita/Desktop/insight-ds-project_news-link/data/raw/Jan_23_news.csv').dropna(subset=['title', 'maintext'])
newsJan24 = pd.read_csv('/Users/Cylita/Desktop/insight-ds-project_news-link/data/raw/Jan_24_news.csv').dropna(subset=['title', 'maintext'])
newsJan26 = pd.read_csv('/Users/Cylita/Desktop/insight-ds-project_news-link/data/raw/Jan_26_news.csv').dropna(subset=['title', 'maintext'])
newsJan27 = pd.read_csv('/Users/Cylita/Desktop/insight-ds-project_news-link/data/raw/Jan_27_news.csv').dropna(subset=['title', 'maintext'])

In [34]:
#Combining all the dataframes
combine_news = pd.concat([newsJan20, newsJan21, newsJan23, newsJan24, newsJan26, newsJan27], axis=0)

## Getting similar links inserted by journalists

In [38]:
#Likely some duplicate stories referenced over different days and different landing pages
#Removing duplications
unique_news = combine_news.drop_duplicates(subset = 'mainurl')
print(len(unique_news))

1221

In [77]:
#Isolating individual similar and realted links by storying them in separate columns (right now all in the same)
splitlinks = unique_news['simlinks'].str.split(' ', expand=True)
#Some articles have a lot of links. To take a manageable sample I will reduce this to only take the first 4 similar 
#links inserted by a journalist
splitlinks_reduced = splitlinks.loc[:,1:4]
#Renaming columns
splitlinks_foradding = splitlinks_reduced.rename(columns={1:"sim1", 2:'sim2', 3:'sim3', 4:'sim4'})

In [1]:
#Adding the first four similar links as four new columns to the exisiting dataframe
UniqLinkNews = pd.concat([unique_news, splitlinks_foradding], axis=1)

#Exporting as a dataframe
#This dataframe will be used to scrape similar link articles using webscraping functions 
#(see scraping similar articles script)
UniqLinkNews.to_csv('Cleaned_Split_RawNews.csv')

NameError: name 'pd' is not defined

## Combing all news stories to create corpus

I have already used the above dataframe to scrape simialar stories inserted by journalists. I will now be adding these stories to the original news stories that I scraped. The resulting dataframe will be used to help validate the approach I have chosen (see validation scripts). 

In [40]:
#Reading in data frame with scraped web articles
sim_news = pd.read_csv("/Users/Cylita/Desktop/insight-ds-project_news-link/data/processed/News_Raw_ValData.csv").dropna(subset=['title', 'maintext'])
#Triming to only relevent columns
trimmed_sim = sim_news[['author', 'date','title','maintext', 'mainurl']]

#Reading in the original dataframe where all similar links were split into columns (i.e. partial scraped news corpus)
UniqLinkNews = pd.read_csv('/Users/Cylita/Desktop/insight-ds-project_news-link/data/processed/Cleaned_Split_RawNews.csv')
trimmed_full = UniqLinkNews[['author','date', 'title', 'maintext', 'mainurl', 'simlinks', 'sim1']]

In [42]:
#Combining the above two dataframes
raw_val = pd.concat([trimmed_full, trimmed_sim], axis=0, ignore_index = True, sort=True)

In [96]:
#For the above combined dataframe - creating combined text (story + title) and cleaning in preparation for NLP 
#Combing the news story headlines with the maintext for the stories
full_text = raw_val["title"].map(str)+ '. ' + raw_val["maintext"]
#Preprocessing all text to conduct any NLP analysis
normfull = TN.normalize_NewsText(full_text)

In [37]:
#Adding in cleaned text to dataframe
raw_val['full_text'] = full_text
raw_val['cleaned_text'] = normfull

NameError: name 'full_text' is not defined

In [149]:
#Exporting cleaned and processed text to a corpus document
raw_val.to_csv("Val_Corpus.csv")

In [80]:
#Reading back in the new dataframe to clean up, based on the external validation script run
val_corpus = pd.read_csv("Val_Corpus.csv")

In [101]:
#Removing all duplicates from the full news corpus that will be used in the project
unique_val = val_corpus.drop_duplicates(subset=['mainurl'])

In [106]:
#Isolating the first 30% of a news story, based on cleaned full text
reduced_text = []

for doc in unique_val['cleaned_text']:
    redu = TN.Clean30(doc)
    reduced_text.append(redu)

unique_val["reduced"] = reduced_text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [109]:
#Exporting cleaned and processed text to a corpus document
unique_val.to_csv("Final_Cleaned_Corpus.csv")