# Notebook for the feature extraction Python programming assignment (2.0 points)

Directory **20news-18828** contains a list of subdirectories. Each one is a newsgroup. Each newsgroup contains a list of files. If you have uncompressed the data file here, no error should be displayed. Macs might give some trouble.

In [1]:
import os
print('The current directory is: ' + os.getcwd())
prefix = "20news-18828\\"        
directories = os.listdir(prefix)
files = os.listdir(prefix+directories[0])

print('\nThe subdirectories / newsgroups are:', directories)

print("\nThe first two files of alt.atheism are: ")
print(prefix + directories[0] + "\\" + files[0])
print(prefix + directories[0] + "\\" + files[1])

The current directory is: C:\Users\alber\Documents\UC3M\Advanced Programming\Assignment\Assignment2

The subdirectories / newsgroups are: ['alt.atheism', 'comp.graphics']

The first two files of alt.atheism are: 
20news-18828\alt.atheism\49960
20news-18828\alt.atheism\51060


Next, you can see the *getMessage()* function. It opens a file that contains a message, reads the message, closes the file, and returns the message as a string.

In [2]:
def getMessage(fileName):
    # Now, we open the file for reading
    myFile = open(fileName, "r")
    # read the complete message.
    message = myFile.read()
    # And close the file
    myFile.close()    
    return(message)

Here, we test that the function is working properly. 

In [3]:
my_file = prefix + directories[0] + "\\" + files[0]
print("I'm going to read the following file: ", my_file)
print('-----------------------------------------------------------')
message = getMessage(my_file)
print(message)

I'm going to read the following file:  20news-18828\alt.atheism\49960
-----------------------------------------------------------
From: mathew <mathew@mantis.co.uk>
Subject: Alt.Atheism FAQ: Atheist Resources

Archive-name: atheism/resources
Alt-atheism-archive-name: resources
Last-modified: 11 December 1992
Version: 1.0

                              Atheist Resources

                      Addresses of Atheist Organizations

                                     USA

FREEDOM FROM RELIGION FOUNDATION

Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.

Write to:  FFRF, P.O. Box 750, Madison, WI 53701.
Telephone: (608) 256-8900

EVOLUTION DESIGNS

Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.

Write to:  Evolution Designs, 711

**Section 1.**. Grade: 0.2 points.

Program a function that takes a message (a string) as input and returns a list with the words in the string. The function should do the following things:

- Convert the message to lowercase
- Convert '.', ',', '\n', '\t' to ' ', so that all separators become the blank space. You may consider replacing other separators, in addition to the former ones.
- Split the words contained in the message
- If there are empty words (''), they should be removed
- Return the list of words

In [4]:
# Import the regular expresions library to delete numbers in the messages
import re

In [5]:
def returnWords(message):
    """
    Returns the list of words in the given message, after having replace some special characters 
    (. , ; : = _ @ n t ? ( ) " < > # + % / * and numbers from 0-9)
    """
    message = message.lower()
    to_change_chars = ['.', ',' , ';', ':', '=', '_', '@', '\n', '\t', '?', '(', ')', '"', "<", ">", '#',
                       '+', '%', '/', '*']
    for char in to_change_chars:
        if char in message:
            message = message.replace(char, ' ')
    message = re.sub('[0-9]','', message)
    words = message.split(' ')
    while '' in words:
        words.remove('')
    
    return(words)

In [6]:
# If your function works properly, the following should work

message = 'In a hole in the ground, there lived a Hobbit'
print(returnWords(message))

['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit']


** Point 2.**.  Grade: 0.30 points.

Define a function that creates a dictionary for counting the words in a sentence. Use a *defaultdict*.

In [7]:
from collections import defaultdict

def createDictionary(words):
    """
    Returns a dictionary with the count of each word in words list
    """
    words_count = defaultdict(int)
    for word in words:
        words_count[word] += 1
    return words_count

In [8]:
# If your function works properly, the following should work

words = ['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit']
dictio = createDictionary(words)
print(list(dictio.items()))

[('in', 2), ('a', 2), ('hole', 1), ('the', 1), ('ground', 1), ('there', 1), ('lived', 1), ('hobbit', 1)]


** Point 3. **. Grade: 0.4 points.

Write a function that takes the previous dictionary and computes the ratio for each word.

In [9]:
def computeRatios(dictionary):
    """
    Returns a dictionary with the count ratio of each word in dictionary (count/total)
    """
    ratio_dict = {}
    total = sum(dictionary.values())
    for word in dictionary:
        ratio_dict[word] = dictionary[word] / total
    return(ratio_dict)

In [10]:
# If your function is correct, you should get what follows

words = ['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit']
dictio = createDictionary(words)
dictio = computeRatios(dictio)
print(list(dictio.items()))

[('in', 0.2), ('a', 0.2), ('hole', 0.1), ('the', 0.1), ('ground', 0.1), ('there', 0.1), ('lived', 0.1), ('hobbit', 0.1)]


** Point 4. **. Grade: 0.5 points.

The code to write next is a bit harder. You have to loop through all subdirectories in "20news-18828\\". In fact, there are only two subdirectories: alt.atheism and comp.graphics. For each subdirectory, you have to loop through the **10 first files**. We are reading only 10 files within each subdirectory to save time, but in a real situation, we would read all the files. Therefore you will have two nested loops. 

For each file, you have to:

- read the message using the *getMessage* function
- obtain the words in the message using the function you defined in Point 1
- obtain the counts of the words of the message using the function in Point 2
- compute the ratio of words using the function in Point 3
- once you have the list with the word ratios for a message, append it to a list

In [11]:
# Prefix is a variable that contains the directory where each newsgroup is located
# Notice that a double \\ has to be used when \ is desired within a string
prefix = "20news-18828\\"        
directories = os.listdir(prefix)
print(directories)

# The messages list will contain one element per message. 
# Each element contains the word ratios for each message.
messages = []

# The classes list contains the class of each message
# In this example, there are only two classes: alt.atheism and comp.graphics
classes = []

# Please, write your code from here

for folder in directories:
    files = os.listdir(prefix + folder)
    for file_id in files[0:10]:
        file = prefix + folder + '\\'+ file_id
        message = getMessage(file)
        words = returnWords(message)
        words_count = createDictionary(words)
        words_ratio = computeRatios(words_count)
        messages.append(words_ratio)
        classes.append(folder)

['alt.atheism', 'comp.graphics']


In [12]:
# If the previous code is correct, you should get something like this:
import pprint
pp = pprint.PrettyPrinter(indent=4)
# First, we print the 3 first words of the four first messages
print('The first four messages are:')
pp.pprint([list(m.items())[:3] for m in messages[:4]])

# Then, we print the classes of the four first messages
print('\nThe first four classes are:')
print(classes[:4])

The first four messages are:
[   [   ('from', 0.00853658536585366),
        ('mathew', 0.001829268292682927),
        ('mantis', 0.001829268292682927)],
    [   ('from', 0.002306805074971165),
        ('mathew', 0.0009611687812379854),
        ('mantis', 0.0005767012687427913)],
    [   ('from', 0.007122507122507123),
        ('i', 0.011396011396011397),
        ('dbstu', 0.0014245014245014246)],
    [   ('from', 0.004132231404958678),
        ('mathew', 0.012396694214876033),
        ('mantis', 0.004132231404958678)]]

The first four classes are:
['alt.atheism', 'alt.atheism', 'alt.atheism', 'alt.atheism']


**Point 5.**. Grade: 0.5 points.

Now, you have to compute the set of all unique words in all messages. That is, each word should appear just once. Please, store them in a variable called *unique_words*.

- In order to get the unique word of a message, you can use sets. That is, if you transform each message to a set, you will get the unique words in that message. 
- In order to compute the unique words of **all** messages, you can use the union of sets.

In [13]:
unique_words = set().union(*(mess_item.keys() for mess_item in messages))

In [14]:
# If your code works well, you should get something similar to (not necessarily exactly the same)
print(sorted(list(unique_words)[100:110]))

['dark-green-white-yellow', 'kolesov', "lamb's", 'mental', 'navy', 'people', 'perhaps', "person's", 'test', 'trimmed']


**Point 6.**. Grade: 0.5 points.

Finally, write everything into a text file called *output.txt*. For each message, there should be a line in the file. The last column in the file should be the class of the message. Here you can see the initial bits of the first two lines in the file (your actual lines could be different). 

0.3299009900990099 0 0 0 0.00039603960396039607 0 0 0 0 0 0 0.00039603960396039607 ... alt.atheism

0.1550550874918989 0 0.0003240440699935191 0 0 0.0003240440699935191 0.00016202203499675956 0.0008101101749837978 0.00016202203499675956 0 0.00016202203499675956 0 ... alt.atheism

In Aula Global you can see my own "output.txt" file so that you can compare it with yours. However, yours might be slightly different.

In [15]:
my_file = open("output.txt", "w")

# Write header of words and message class
for word in unique_words:
    my_file.write('%s ' %word)

my_file.write('class\n')

# Write a row by a message with the ratio count of each word in header
i = 0
for message in messages:
    for word in unique_words:
        my_file.write('%s ' %message.get(word, 0))
    my_file.write('%s\n' %classes[i])
    i += 1

my_file.close()