# Data Acquisition: CSV processing

Hi everyone, <br />

This session deals with CSV processing, which is a significant source of data acquisition.

It will walk you though the following sections:

1. CSV document
2. CSV processing
3. Features engineering
4. Import Titanic data in MongoDB
5. Import Hillary emails in MongoDB

## 1. CSV document

** What is a CSV file ? **

CSV files are used to store tabular data in plain text. <br\>
You can think of tabular data as data that can fit a spreadsheet. <br\>
Each line of the file is a data record and each field in the line is separated by a comma. <br\>
CSV stands for comma-separated values.

** Titanic CSV dataset **

In this module, we introduce the CSV file processing with a practical case study. We will use a CSV file with data about passengers from the Titanic. The file contains biographical and travel data about the passengers and indicates whether they survived the sinking or not. Later in the course this dataset will be used to train a machine learning model able to predict the survival of a passenger.<br>

CSV file is named "titanic_passengers_train.csv" and is contained in the folder named "titanic_data". Each row describes a passenger with following variables:
<br>
<br>- <b>PassengerId</b>: Unique ID of the passenger
<br>- <b>Survived</b>: Indicates if the passenger survived or not
<br>- <b>Pclass</b>:  Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
<br>- <b>Name</b>
<br>- <b>BirthDate</b>
<br>- <b>SibSp</b>: Number of Siblings/Spouses Aboard
<br>- <b>Parch</b>: Number of Parents/Children Aboard
<br>- <b>Fare</b>: Passenger Fare
<br>- <b>Cabin</b>: Cabin number
<br>- <b>Embarked</b>: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
<br><br>

**1.1 Can you open "titanic_passengers_train.csv" file with Excel and visualize the data ?**

## 2. CSV Processing

In [1]:
import unicodecsv
from datetime import datetime, timedelta
from pymongo import MongoClient
from pprint import pprint
db = MongoClient()["Solvay"]
import pprint
pp = pprint.PrettyPrinter(depth=6)

** Load CSV file in Python **

This step of the process aims to open the CSV file and to make it interpretable by Python. <br>

Unicodecsv library imports CSV data in Python. <br\>

DictReader object encodes each row in a dictionary where: <br\>
- the key is the variable name <br\>
- the value is the information related to the variable for the current row <br\>

**2.1 Can you import "titanic_passengers_train.csv" file in python and print the dictionaries ?**

In [72]:
reader = unicodecsv.DictReader(open("./titanic_data/titanic_passengers_train.csv"), quotechar='"', delimiter=',')
count = 0
for line in reader:
    if count <= 3:
        print line
        print " "
        count +=1
    else:
        pass

{u'Fare': u'7.25', u'Name': u'Braund, Mr. Owen Harris', u'Embarked': u'S', u'Parch': u'0', u'BirthDate': u'1890-01-09', u'Pclass': u'3', u'Survived': u'0', u'SibSp': u'1', u'PassengerId': u'1', u'Ticket': u'A/5 21171', u'Cabin': u''}
 
{u'Fare': u'71.2833', u'Name': u'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', u'Embarked': u'C', u'Parch': u'0', u'BirthDate': u'1874-03-23', u'Pclass': u'1', u'Survived': u'1', u'SibSp': u'1', u'PassengerId': u'2', u'Ticket': u'PC 17599', u'Cabin': u'C85'}
 
{u'Fare': u'7.925', u'Name': u'Heikkinen, Miss. Laina', u'Embarked': u'S', u'Parch': u'0', u'BirthDate': u'1886-04-06', u'Pclass': u'3', u'Survived': u'1', u'SibSp': u'0', u'PassengerId': u'3', u'Ticket': u'STON/O2. 3101282', u'Cabin': u''}
 
{u'Fare': u'53.1', u'Name': u'Futrelle, Mrs. Jacques Heath (Lily May Peel)', u'Embarked': u'S', u'Parch': u'0', u'BirthDate': u'1876-09-16', u'Pclass': u'1', u'Survived': u'1', u'SibSp': u'1', u'PassengerId': u'4', u'Ticket': u'113803', u'Cabin': u'C12

## 3. Features engineering

It is often the case that new features can be identified from existing features. <br\>
It is then interesting to create new variables if it fits your needs. <br\>
This process is called 'features engineering'.

** Get gender from passenger name **

As you can see in the CSV file, each entry has a name and his name contains a title ("Miss.", "Mrs." or "Mr."). We can use this information in order to feature engineered passengers'gender. 

** 3.1 Can you create a function named 'get_gender_from_name' that takes a string as input (passenger's name) and outputs: ** <br>
"f" if it's a woman <br>
"m" if it's a man <br>
"null" if it's unknown <br>

In [89]:
def get_gender_from_name(p_name=None):
    if p_name == None:
        return 'Null'
    else:
        if 'mrs' in p_name.lower().strip() or 'miss' in p_name.lower().strip():
            return "f"
        elif 'mr' in p_name.lower().strip():
            return "m"
    

**3.2 Can you test your function with the following names ?**
"Mrs. Robinson",
"Mr. Tikkeling",
"Miss. Smith",
"Albert"

In [88]:
get_gender_from_name()

'Null'

** Transform the "Survived" variable **

The "survived" variable is currently encoded in a string "1" or "0". As it will be the "answer" in the statistical model, it could be relevant to cast it in a boolean type (True/False).

**3.3 Can you create a function named "cast_survived_in_boolean" that transforms the variable "survived" (1 or 0) in a boolean form (True or False) ? **

Note: you must first cast the variable from string to integer, and then transform the integer into a boolean variable.

In [73]:
for line in reader:
    line["Survived"] = bool(int(line["Survived"]))

In [74]:
def cast_survived_in_boolean(arg):
    result = bool(int(arg))
    return result

In [79]:
'it' in "ring"

False

**3.4 Can you test your function with the following values ?**
"1",
"0"

In [60]:
print(cast_survived_in_boolean("1"))

True


# 4. Import Titanic data in MongoDB

Once imported in Python, data can be transformed and imported in a MongoDB database. 

**4.1 Can you import passengers data in Python,<br> **
**compute 'gender' and 'survival boolean' for each document,<br> **
**and import it in a MongoDB collection called 'titanic_passengers'**

In [102]:
reader = unicodecsv.DictReader(open("./titanic_data/titanic_passengers_train.csv"), quotechar='"', delimiter=',')
for line in reader:
    line['Survived'] = cast_survived_in_boolean(line['Survived'])   
    line['gender'] = get_gender_from_name(line['Name'])
    db['titanic_passengers'].save(line)

**4.2 Try to get one of the passengers from the new collection ?**

In [2]:
record = db['titanic_passengers'].find_one()
pp.pprint(record)

{u'BirthDate': u'1890-01-09',
 u'Cabin': u'',
 u'Embarked': u'S',
 u'Fare': u'7.25',
 u'Gender': u'm',
 u'Name': u'Braund, Mr. Owen Harris',
 u'Parch': u'0',
 u'PassengerId': u'1',
 u'Pclass': u'3',
 u'SibSp': u'1',
 u'Survived': u'0',
 u'Survived_Boolean': False,
 u'Ticket': u'A/5 21171',
 u'_id': ObjectId('5890aea901651c192c959004')}


# 5. Import Hillary emails in MongoDB

Register to Kaggle website: "https://www.kaggle.com/" <br>
Download data on Hillary Clinton's Emails: https://www.kaggle.com/kaggle/hillary-clinton-emails <br>
Import first 10.000 emails in your MongoDB Solvay database under the name "hillary_emails" <br>
Investigate how often and why Hillary is refering to Vladimir Putin on her emails <br>

In [16]:
reader = unicodecsv.DictReader(open("Emails.csv"), quotechar='"', delimiter=',')
count = 0
for line in reader:
    if count <=10000:
        db['hillary_emails'].save(line)
        count = db['hillary_emails'].find().count()
    else:
        break

In [14]:
reader = unicodecsv.DictReader(open("Emails.csv"), quotechar='"', delimiter=',')
count = 0
for line in reader:
    count += 1
print count

7945


In [17]:
db.drop_collection('hillary_emails')

In [8]:
record = db['hillary_emails'].find_one()
pp.pprint(record)

{u'DocNumber': u'C05739545',
 u'ExtractedBodyText': u'',
 u'ExtractedCaseNumber': u'F-2015-04841',
 u'ExtractedCc': u'',
 u'ExtractedDateReleased': u'05/13/2015',
 u'ExtractedDateSent': u'Wednesday, September 12, 2012 10:16 AM',
 u'ExtractedDocNumber': u'C05739545',
 u'ExtractedFrom': u'Sullivan, Jacob J <Sullivan11@state.gov>',
 u'ExtractedReleaseInPartOrFull': u'RELEASE IN FULL',
 u'ExtractedSubject': u'FW: Wow',
 u'ExtractedTo': u'',
 u'Id': u'1',
 u'MetadataCaseNumber': u'F-2015-04841',
 u'MetadataDateReleased': u'2015-05-22T04:00:00+00:00',
 u'MetadataDateSent': u'2012-09-12T04:00:00+00:00',
 u'MetadataDocumentClass': u'HRC_Email_296',
 u'MetadataFrom': u'Sullivan, Jacob J',
 u'MetadataPdfLink': u'DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545/C05739545.pdf',
 u'MetadataSubject': u'WOW',
 u'MetadataTo': u'H',
 u'RawText': u'UNCLASSIFIED\nU.S. Department of State\nCase No. F-2015-04841\nDoc No. C05739545\nDate: 05/13/2015\nSTATE DEPT. - PRODUCED TO HOUSE SELECT BENGHAZI COMM.\nSUBJ

In [12]:
db['hillary_emails'].find().count()

7945