# Machine Learning Basics
In this module, you'll be acquiring and handling datasets. You will be using the Cinema Data, Salary Data and Reviews Data for the tasks in this module. <br> <br>
**Pipeline:**
* Acquiring the data
* Handling files and formats
* Data Analysis
* Prediction
* Analysing results

## Task 1 - Data Acquisition
* Retrieve the CinemaData dataset from Firebase, convert it to a CSV and save it in the 'Data' folder as 'CinemaData.csv'. You may use shell scripts, other packages and any other resources you require to do this. The database can be accessed with a HTTP request, ask a TA for the link. <br> 
* Using `wget`, download the 'SalaryData.txt' and save it in the 'Data' folder. Convert it to a CSV named 'SalaryData.csv' and save it in the same folder. It is avaliable at this link: <br>
http://rebrand.ly/ml_salarydata

In [7]:
import requests
import pandas as pd


In [2]:
data = requests.get('https://sf-mlbasics.firebaseio.com/CinemaData.json')

In [6]:
df = pd.DataFrame(data.json())
df.to_csv('./Data/CinemaData.csv')
df.head()

Unnamed: 0,Capacity,DaysShowedInWeek,Index,LastDate,Lifetime,Movie,OccAtWeek,OccPer,OtherReleasesInWeek,ReleaseDate,ShowsInWeek,WeeksSinceRelease
0,3100,5,0,2015-12-17,2,1000001,2494,0.804516,8,2015-12-04,10,0
1,3390,7,1,2015-12-17,2,1000001,1932,0.569912,4,2015-12-04,14,1
2,860,2,2,2015-12-17,2,1000001,222,0.25814,4,2015-12-04,4,2
3,110,1,3,2016-01-06,5,1000002,59,0.536364,8,2015-12-04,1,0
4,720,6,4,2016-01-06,5,1000002,630,0.875,7,2015-12-04,6,3


In [7]:
!wget -q 'http://rebrand.ly/ml_salarydata' -O './Data/SalaryData.txt'

In [8]:
df2 = pd.read_csv('Data/SalaryData.txt',sep=' ')
df2.to_csv('./Data/SalaryData.csv')
df2.head()

Unnamed: 0,YearsExperience,Salary
0,1.1,39343
1,1.3,46205
2,1.5,37731
3,2.0,43525
4,2.2,39891


## Task 2 - Dataset Handling
* You can find the Reviews Data in a RAR file in the 'Data' directory. Extract this dataset and use it for this module.

* The dataset contains positive and negative movie reviews. The files 'Positive_Reviews.txt' and 'Negative_Reviews.txt' contain names of files having positive and negative reviews respectively. Create two directories ‘pos’ and ‘neg’, and segregate the reviews accordingly into the two directories.

* Load ‘cv000_29590.csv’ and report the number of words present in the first column.

* Find the number of unique words in the first column. For this task, ignore punctuations, that is, punctuations are not considered as a word or a part of it.

* Lookups: OS module, String functions

In [4]:
!mkdir 'Data/pos'


with open('Data/Positive_Reviews.txt') as pos_file:
    contents = pos_file.read()
    
    #First and last character are '[' and ']' 
    contents = contents[1:len(contents)-1]
    
    #Splitting the contents based on ',''
    contents = contents.replace('\'','')
    contents = contents.replace(' ','')
    filenames = contents.split(',')
    for file in filenames:
        !cp "./Data/Reviews/$file" "./Data/pos/"
        

mkdir: cannot create directory ‘Data/pos’: File exists
mkdir: cannot create directory ‘Data/neg’: File exists


In [5]:
!mkdir 'Data/neg'


with open('Data/Negative_Reviews.txt') as neg_file:
    contents = neg_file.read()
    
    #First and last character are '[' and ']' 
    contents = contents[1:len(contents)-1]
    
    #Splitting the contents based on ',''
    contents = contents.replace('\'','')
    contents = contents.replace(' ','')
    filenames = contents.split(',')
    for file in filenames:
        !cp "./Data/Reviews/$file" "./Data/neg/"
        

mkdir: cannot create directory ‘Data/neg’: File exists


In [85]:
df3 = pd.read_csv('./Data/Reviews/cv000_29590.txt')
df3.fillna('',inplace=True)
words=list()
for line in df3[' superman ']:
        line = line.replace('.','').replace('(','').replace(')','').replace('?','')
        words.extend(line.split())
print("The number of words: ",len(words))

The number of words:  191


In [86]:
df3 = pd.read_csv('./Data/Reviews/cv000_29590.txt')
df3.fillna('',inplace=True)
words=set()
for line in df3[' superman ']:
        line = line.replace('.','').replace('(','').replace(')','').replace('?','')
        for word in line.split():
            words.add(word)
print("The number of Unique words: ",len(words))

The number of Unique words:  144
