## Files

Contain data which we can use or manipulate.

There are three steps to working with files:

1. Opening the file - 
    ``open ()``
    
2. Reading data from the file - 
    ``read ()``
    
3. Closing the file - 
    ``close ()``

*Note*: While working with files or filenames, ``.split ()`` is used frequently.

In [1]:
file_path = '..datasets\wine.csv'

print (file_path.split ('\\'))
print (file_path.split ('\\')[-1]) # .split () return a string
print (file_path.split ('\\')[-1].split ('.'))
print (file_path.split ('\\')[-1].split ('.')[0])

['..datasets', 'wine.csv']
wine.csv
['wine', 'csv']
wine


In [2]:
f = open ('..\datasets\wine.csv')

string = f.read ()
print (string[:400])
f.close ()

class_label,class_name,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280,proline
1,Barolo,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,Barolo,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,Barolo,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,Barolo,14.3


**Reading from a file after it has been read returns None, since the cursor reaches the EOF (end of file).**

To read again, the file needs to be closed and then opened again.

In [3]:
f = open ('..\datasets\wine.csv')

print (f.read ()[:400])
print ('Trying to read the file again.')
print (f.read ())
f.close ()

class_label,class_name,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280,proline
1,Barolo,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,Barolo,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,Barolo,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,Barolo,14.3
Trying to read the file again.



To read again from the file without closing the file, ``.seek ()`` is used (although rarely).

In [4]:
f = open ('..\datasets\wine.csv')

print (f.read ()[:400])
print ('Trying to read the file again.')
f.seek (0)
print (f.read ()[:400])
f.close ()

class_label,class_name,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280,proline
1,Barolo,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,Barolo,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,Barolo,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,Barolo,14.3
Trying to read the file again.
class_label,class_name,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280,proline
1,Barolo,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,Barolo,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,Barolo,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,Barolo,14.3


By default, a file is opened in the read mode. We can pass an argument in the ``.open ()`` function to specify how to open the file:

Ex: ``f = open (filepath, 'mode')``

Mode: 
* r: read only mode 

* w: write only mode. Overwrites the previous content in the file

* a: append mode. Adds the content at the end of the file.

* b: opens the file in binary mode for images, non text files 

* +: used with any of above mode for both reading and writing

In [5]:
f = open('sampleFile.txt','w')
f.write('This is a sample file and it is just created.')
f.close()

f  = open('sampleFile.txt')
data = f.read()
print(data)
f.close()

This is a sample file and it is just created.


In [6]:
f = open('sampleFile.txt','w')
f.write('This is a sample file again and it is just created again.')
f.close()

f  = open('sampleFile.txt')
data = f.read()
print(data)
f.close()

This is a sample file again and it is just created again.


In [7]:
f = open('sampleFile.txt','a')
f.write('\n\nThis is a new line that we are writing.')
f.close()

f  = open('sampleFile.txt')
data = f.read()
print(data)
f.close()

This is a sample file again and it is just created again.

This is a new line that we are writing.


Some additional notes:

1. Read a file line by line


* ``readline ()`` : read a single line and then read again
* ``readlines ()`` : read all the lines and returns a list
* ``writelines ()`` : write lines from an iterable like a list


2. ``with``statement

Can be used to do stuff without worrying about closing the file.

In [8]:
with open ('sampleFile.txt') as f_in:
    rows = f_in.readlines ()
    for i in range (len (rows)):
        print (rows[i])
    
print (rows)
rows[1] = 'New line! Who wrote this anyway?\n'

with open ('sampleFile.txt', 'w') as f_out:
    f_out.writelines (rows)

with open ('sampleFile.txt') as f_in:
    rows = f_in.readlines ()
    for i in range (len (rows)):
        print (rows[i])

This is a sample file again and it is just created again.



This is a new line that we are writing.
['This is a sample file again and it is just created again.\n', '\n', 'This is a new line that we are writing.']
This is a sample file again and it is just created again.

New line! Who wrote this anyway?

This is a new line that we are writing.


## Modules

Modules in python are files which contains classes, function and variables which can be used while scripting.

Since modules are external we need to bring them into our program (ie importing a module)

For example:
```
    import math
```

In [9]:
# importing the whole module

import math

pie = math.pi 
print (pie)

3.141592653589793


In [10]:
# importing a specific object from a module
# from math import pi

# importing specific object from a module while giving it an alias
from math import pi as p

print (p)

3.141592653589793


## Reading CSV files

**csv** =  Comma Separated Values

Here each value is separated by comma. When reading csv files if we use simple file read/write functions of Python we get a complete string.

To separate the values we need to use the ``.split()`` function of string object to get the values in a list and then we need to get the values individually.

In [11]:
# ./ means file is in the same directory as the JuPyter Notebook
# ../ means the file is in the parent directory of the JuPyter Notebook


with open ('../datasets/wine.csv') as f:
    line = f.readline ()
    print (line)
    header_list = line.split (',')
    print (header_list)
    print (header_list[3])
    
    for line in f:
        data = line.split (',')
        print (data[3])
        break

class_label,class_name,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280,proline

['class_label', 'class_name', 'alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280', 'proline\n']
malic_acid
1.71


## csv module

csv module does the boring work for us and gives the data already separated as we want.

* **``reader ()``** - reads the data from file based on separator
* **``writer ()``** - writes the data in csv format

``next ()`` - next function gets us the value at the current position in the iterator and the moves the cursor to the next item in iterator.

We use ``next ()`` function to skip some lines which we do not want some times, eg. skipping the header row or skipping some rows at the start.

In [12]:
import csv

with open('../datasets/wine.csv') as f:
    reader = csv.reader (f, delimiter = ',')
    row = next (reader)
    for row in reader:
        print (row)
        break

['1', 'Barolo', '14.23', '1.71', '2.43', '15.6', '127', '2.8', '3.06', '0.28', '2.29', '5.64', '1.04', '3.92', '1065']


In [13]:
with open('../datasets/wine.csv') as f:
    reader = csv.reader (f) # by default the delimiter is ','
    for row in reader:
        print (row)
        break

['class_label', 'class_name', 'alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280', 'proline']


In [14]:
my_data = ['spam', 'ham', 'poha', 'parantha']

with open ('csvwrite.csv', 'w') as f_out:
    writer = csv.writer (f_out, delimiter = ',')
    writer.writerow (my_data)
    
with open ('csvwrite.csv') as f_in:
    reader = csv.reader (f_in)
    print (reader)
    for row in reader:
        print (row)

<_csv.reader object at 0x00000207882CF4C0>
['spam', 'ham', 'poha', 'parantha']
[]


Plain csv reader and writer objects give us the ability to read the data properly but we are getting is still a list, which becomes cumbersome to work with (we have to manually remember the index of each column to know about what data column we are working with).

Fortunately, csv module provides us with the 2 great classes for this task

* **``DictReader ()``** - Read the data from the file row by row as a dictionary
* **``DictWriter ()``** - Write the data to a csv file from dictionaries.

Both the classes work similary as the ``reader ()`` and ``writer ()`` function.

In [15]:
from pprint import pprint
# shows the dictionary output beautifully having each key value pair in each line instead of in a single line

with open ('../datasets/wine.csv') as f:
    reader = csv.DictReader (f)
    for row in reader:
        pprint (row)
        break

OrderedDict([('class_label', '1'),
             ('class_name', 'Barolo'),
             ('alcohol', '14.23'),
             ('malic_acid', '1.71'),
             ('ash', '2.43'),
             ('alcalinity_of_ash', '15.6'),
             ('magnesium', '127'),
             ('total_phenols', '2.8'),
             ('flavanoids', '3.06'),
             ('nonflavanoid_phenols', '0.28'),
             ('proanthocyanins', '2.29'),
             ('color_intensity', '5.64'),
             ('hue', '1.04'),
             ('od280', '3.92'),
             ('proline', '1065')])


In [16]:
dict_to_write = {'age': '20', 'height': '62', 'id': '1', 'weight': '120.6', 'name': 'Alice'}

with open ('DictWriterSample.csv', 'w') as f_out:
    writer = csv.DictWriter (f_out, delimiter="|", fieldnames =  dict_to_write.keys())
    writer.writeheader ()
    writer.writerow (dict_to_write)
    
with open ('DictWriterSample.csv') as f_in:
    reader = csv.DictReader (f_in, delimiter="|")
    for row in reader:
        print (row)

OrderedDict([('age', '20'), ('height', '62'), ('id', '1'), ('weight', '120.6'), ('name', 'Alice')])


## datetime module

Helps us convert the strings to date objects. We use this module to do many tasks like:

* Convert string to datetime objects
* Convert date to strings of our choice - US uses month/day/year we use day/month/year format
* Do operations on date eg. add days, month, years
* Do operations like select only data for one month or a week

In [17]:
from datetime import datetime

This datetime object has a function ``strptime ()`` which we read as (str)ing (p)arse (time).

**``strptime ()``** - converts string to datetime object

It has 2 arguments:

1. datestring: the string which contains the date eg. *25/05/2018*
2. format: the format in which the string is preset eg. *%d/%m/%Y*

In [18]:
datestring = "22/06/2018"
date_object = datetime.strptime(datestring, "%d/%m/%Y")

print(date_object)
print(type(date_object))

my_date = "9/1/2015 6:09"

date_obj = datetime.strptime(my_date,'%m/%d/%Y %H:%M')
print(date_obj)
print(type(date_obj))

2018-06-22 00:00:00
<class 'datetime.datetime'>
2015-09-01 06:09:00
<class 'datetime.datetime'>


To do operations such as adding days, months to our date object we use ``timedelta ()`` from the datetime module.

In [19]:
from datetime import timedelta

date_obj = date_obj + timedelta(weeks=6)
print(date_obj)

2015-10-13 06:09:00


Common arguments of ``timedelta ()`` are:

* weeks
* days
* hours
* seconds
* minutes

In [20]:
indian_date = '18/6/2018'

training_start_date = datetime.strptime (indian_date, '%d/%m/%Y')
training_end_date = training_start_date + timedelta (weeks=6)

diff = training_end_date - training_start_date
print (diff.days)
print (diff.total_seconds ())

42
3628800.0


**``strftime ()``** - This works on the datetime object. It takes the argument as *formatString* ie the format in which we want to get the date.

This format will be made up of same special characters as seen while converting string to datetime object.

In [21]:
my_date = '9/1/2015 6:09'

date_obj = datetime.strptime (my_date, '%m/%d/%Y %H:%M')
print (date_obj)

#convert back to string in indian format and remove the hour and minute data
normal_date = date_obj.strftime ('%d/%m/%Y')
print (normal_date)        #changed to indian format
print (type (normal_date))

2015-09-01 06:09:00
01/09/2015
<class 'str'>
