## **3**. Text Analysis <a class="anchor" id="textanalysis"/> 

During a online meeting, a client mentioned that a large collection of text data has to be analyzed in your next project. As a Data Scientist, you want to start developing some code that can be useful when you get your hands on the data. To do so, you decide to start a small programming exercise that uses articles about the topic of cars from the [20 newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). You start with the the following code block.

In [1]:
from sklearn.datasets import fetch_20newsgroups # Import the function to download the dataset
import pandas as pd

data = fetch_20newsgroups()
data.target_names

categories = ['rec.autos'] # Define the car topic
dataset = fetch_20newsgroups(subset='all', shuffle=True, categories=categories) # Download articles
articles = dataset.data # Get the article text

Checking the first data instance:

In [391]:
print('---')
print('Text:', articles[0])
print('---')

---
Text: From: mad9a@fermi.clas.Virginia.EDU (Michael A. Davis)
Subject: Slick 50, any good?
Organization: University of Virginia
Lines: 9


     Chances are that this has been discussed to death already, and
if so could someone who has kept the discussion mail me or direct me 
to an archive site. Basically,
I am just wondering if Slick 50 really does all it says that it does.
And also, is there any data to support the claim.  Thanks for any info.

Mike Davis
mad9a@fermi.clas.virginia.edu

---


Since you want to be prepared to handle a **large amount of data**, you think that MapReduce could be useful for producing some insights about the data. **Follow the instructions below** to use the MapReduce paradigm **to compute a word count** of articles.

#### Solution sketch:

1. Preprocess the data: Lower case all words, remove stop words (e.g., the, an, a), and remove punctuation.
2. Define a mapper that produces a key-value pair `(key: word, value:1)`. This will make sure that every word in an article will have an initial count of 1.
3. Define a reducer that takes in a mapping and sums the count of every word.

**a.** Complete the function below to preprocess the articles by transforming all words to lower case.

In [392]:
def lower_case(article):
    lcase = article.lower()
    return lcase

In [393]:
articles = [lower_case(article) for article in articles]
articles[0]

'from: mad9a@fermi.clas.virginia.edu (michael a. davis)\nsubject: slick 50, any good?\norganization: university of virginia\nlines: 9\n\n\n     chances are that this has been discussed to death already, and\nif so could someone who has kept the discussion mail me or direct me \nto an archive site. basically,\ni am just wondering if slick 50 really does all it says that it does.\nand also, is there any data to support the claim.  thanks for any info.\n\nmike davis\nmad9a@fermi.clas.virginia.edu\n'

**b.** Apply the function `remove_stopwords` below to remove stop words.

In [394]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS # Import the list of stop words

def remove_stopwords(article):
    # Iterate over every word in an article and filter out stop words
    # We use the .join() method to concatenate strings using an empty space ` ` between them
    return ' '.join([word for word in article.split() if word not in ENGLISH_STOP_WORDS])

In [395]:
articles = [remove_stopwords(article) for article in articles]
articles[0]

'from: mad9a@fermi.clas.virginia.edu (michael a. davis) subject: slick 50, good? organization: university virginia lines: 9 chances discussed death already, kept discussion mail direct archive site. basically, just wondering slick 50 really does says does. also, data support claim. thanks info. mike davis mad9a@fermi.clas.virginia.edu'

**c.** Use the function below to remove punctuation.

In [396]:
import string # Import the string package to get the string.punctuation feature

def remove_punctuation(text):
    table = str.maketrans(dict.fromkeys(string.punctuation)) # Map punctuation characters
    no_punctuation = text.translate(table) # Remove punctuation from the string
    return no_punctuation

In [397]:
articles = [remove_punctuation(article) for article in articles]
articles[0]

'from mad9afermiclasvirginiaedu michael a davis subject slick 50 good organization university virginia lines 9 chances discussed death already kept discussion mail direct archive site basically just wondering slick 50 really does says does also data support claim thanks info mike davis mad9afermiclasvirginiaedu'

**d.** Complete the mapper function below.

In [398]:
def mapper(article):
    mapping = pd.DataFrame()
    mapping["words"] = article.split()
    mapping["count"] = 1
    return mapping

In [399]:
articlemap = map(mapper, articles)

**e.** Complete the reducer function below to compute the word count.

In [400]:
# experimental code block where i tried to reduce two Dataframes into one
count = pd.DataFrame() # Initialize the variable accordingly
workingMap = pd.concat([workmap[0],workmap[1]])
#count['count']= workingMap.groupby('words')['count'].sum()
test = workingMap.groupby('words').sum()['count'].values

workingMap.groupby('words').sum().reset_index()
#count['words']=workingMap.groupby('words')['words']


Unnamed: 0,words,count
0,1r0vk6innaftcronkitecentralsuncom,1
1,23,1
2,317,1
3,4510815,1
4,46904,1
...,...,...
116,virginia,1
117,waiiiiiit,1
118,ways,1
119,wondering,1


In [401]:
# the :pd.DataFrame is a type declaration (i wanted it to have proper code completion)
def reducer(first_map:pd.DataFrame, second_map:pd.DataFrame):
    workingDataFrame = pd.concat([first_map,second_map])
    count = workingDataFrame.groupby('words')['count'].sum().reset_index()
    return count


In [402]:
from functools import reduce

wordcount = reduce(reducer, articlemap) # Use the reduce() function to compute the word count.


In [403]:
wordcount

Unnamed: 0,words,count
0,0,13
1,00,1
2,000701,1
3,002,7
4,004,2
...,...,...
16218,zugates,1
16219,zx,1
16220,zx11,5
16221,zxr,1


**f.** Sort the word count by its values.

In [406]:
ordered_wordcount = wordcount.sort_values(by='count', ascending=False)
ordered_wordcount

Unnamed: 0,words,count
3537,car,1423
13923,subject,1041
6584,from,1040
8889,lines,1015
10576,organization,986
...,...,...
7140,h3g,1
11186,policies,1
2023,altered,1
7141,h4s,1


**g.** Run the code block below to show the 25 most frequent words.

In [413]:
ordered_wordcount.head(25)


Unnamed: 0,words,count
3537,car,1423
13923,subject,1041
6584,from,1040
8889,lines,1015
10576,organization,986
15967,writes,806
11812,re,804
2334,article,724
3586,cars,572
8867,like,532


In [418]:
# Use the code below to produce a simple character-based Bar Chart.
# We use the special character \u2588 to improve visualization



for v in ordered_wordcount.head(25).itertuples():
    
    bar_size = int(v.count*0.02) # Since the value is too large (> 1000), we need to adjust the size of the bar to favor readability
    print(v.words, '\u2588'*bar_size, v.count)

car ████████████████████████████ 1423
subject ████████████████████ 1041
from ████████████████████ 1040
lines ████████████████████ 1015
organization ███████████████████ 986
writes ████████████████ 806
re ████████████████ 804
article ██████████████ 724
cars ███████████ 572
like ██████████ 532
just ██████████ 521
dont █████████ 483
it █████████ 462
nntppostinghost █████████ 459
university ████████ 441
good ███████ 395
know ███████ 389
new ███████ 376
engine ███████ 371
its ███████ 368
im ███████ 361
think ███████ 355
i ██████ 347
distribution ██████ 330
the ██████ 321
