<a href="https://colab.research.google.com/github/Ashikur-ai/Learn-Machine-Learning/blob/main/python_os_and_file_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will learn:
* Interacting with the filesystem using the `os` module
* Downloading files from the internet using `urllib` module
* Reading and processing data from text files
* Parsing data from CSV files into dictionaries & lists
* Writing formatted data back to text files

# Interacting with the OS and filesystem
The `os` module in Python provides many functions for interacting with the OS and the filesystem. Let's import it and try out some examples.

In [1]:
import os

We can check the present working directory using the `os.getcwd`function

In [2]:
os.getcwd()

'/content'

To get the list of files in a directory, use `os.listdir`. You pass an absolute or relative path of a directory as the argument to the function.

In [3]:
os.listdir('/content')

['.config', 'sample_data']

In [4]:
help(os.listdir)

Help on built-in function listdir in module posix:

listdir(path=None)
    Return a list containing the names of the files in the directory.
    
    path can be specified as either str, bytes, or a path-like object.  If path is bytes,
      the filenames returned will also be bytes; in all other circumstances
      the filenames returned will be str.
    If path is None, uses the path='.'.
    On some platforms, path may also be specified as an open file descriptor;\
      the file descriptor must refer to a directory.
      If this functionality is unavailable, using it raises NotImplementedError.
    
    The list is in arbitrary order.  It does not include the special
    entries '.' and '..' even if they are present in the directory.



In [5]:
os.listdir('.') # relative path

['.config', 'sample_data']

In [6]:
os.listdir('/usr') #absolute path

['include',
 'bin',
 'src',
 'share',
 'sbin',
 'lib',
 'local',
 'games',
 'grte',
 'lib32']

You can create a new directory using `os.makedirs`. Let's create a new directory called `data`, where we'll later download some files.

In [7]:
os.makedirs('./data', exist_ok=True)

In [8]:
'data' in os.listdir('.')

True

In [9]:
os.listdir('./data')

[]

Let us download some files into the `data` directory  using the `urllib` module.

In [10]:
url1 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans1.txt'
url2 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans2.txt'
url3 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans3.txt'

In [11]:
from urllib.request import urlretrieve

In [12]:
urlretrieve(url1, './data/loans1.txt')

('./data/loans1.txt', <http.client.HTTPMessage at 0x7f4046b12190>)

In [13]:
urlretrieve(url1, './data/loans2.txt')

('./data/loans2.txt', <http.client.HTTPMessage at 0x7f4046b12f50>)

In [14]:
urlretrieve(url1, './data/loans3.txt')

('./data/loans3.txt', <http.client.HTTPMessage at 0x7f4046b1df90>)

Let's verify that the files were downloaded.

In [15]:
os.listdir('./data')

['loans1.txt', 'loans3.txt', 'loans2.txt']

#Reading from a file
To read the contents of a file, we first need to open the file using the built-in `open`function. The `open` function returns a file object and provides several methods for interacting with the file's contents.

In [16]:
file1 = open('./data/loans1.txt', mode='r')

The `open` function also accepts a `mode` argument to specifies how we can interact with the file.

To view the contents of the file, we can use the `read` method of the file object.

In [17]:
file1_contents = file1.read()

In [18]:
print(file1_contents)

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


The file contains information about loans. It is a set of comma-separated values(CSV).

>**CSVs:** A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

The first line of the file is the header, indicating what each of the numbers on the remaining lines represents. Each of the remaining lines provides information about a loan. Thus, the second line `100000, 36, 0.08, 20000` represents a loan with:
* an amount of `$100000`
* duration of `36` months, 
* rate of interest of `8%` per annum, and
* a down payment of `$20000`

The CSV is a standard file format used for sharing data for analysis and visualization. Over the course of this tutorial, we will read the data from these CSV files, process it, and write the results back to files. Before we continue, let's close the file using the `close` method(otherwise, Python will continue to hold the entire file in the RAM)

In [19]:
file1.close()

Once a file is closed, you can no longer read from it.

In [47]:
file1.read()

ValueError: ignored

#Closing file automatically using `with`
To close a file automatically after you've processed it, you can open it using the `with` statement.

In [48]:
with open('./data/loans2.txt') as file2:
  file2_contents = file2.read()
  print(file2_contents)

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


Once the statements within the `with` block are executed, the `.close` method on `file2` is automatically invoked. Let's verify this by trying to read from the file object again.

In [49]:
file2.read()

ValueError: ignored

#Reading a file line by line
File objects provide a `readlines` method to read a file line-by-line.

In [23]:
with open('./data/loans3.txt', 'r') as file3:
  file3_lines = file3.readlines()

In [50]:
file3_lines

['amount,duration,rate,down_payment\n',
 '100000,36,0.08,20000\n',
 '200000,12,0.1,\n',
 '628400,120,0.12,100000\n',
 '4637400,240,0.06,\n',
 '42900,90,0.07,8900\n',
 '916000,16,0.13,\n',
 '45230,48,0.08,4300\n',
 '991360,99,0.08,\n',
 '423000,27,0.09,47200']

#Processing data from files
Before performing any operations on the data stored in a file, we need to convert the file's contents from one large string into Python data types. For the file `loans1.txt` containing information about loans in a CSV format, we can do the following:
* Read the file line by line
* Split each remaining line and convert each value into a float
* Create a dictionary for each loan using the headers as keys
* Create a list of dictionaries to keep track of all the loans

Since we will perform the same operations for multiple files, it would be useful to define a function `read_csv`. We'll also defining a function `parse_header` that takes a line as input and returns a list of column headers.

In [24]:
def parse_headers(header_line):
  return header_line.strip().split(',')

The `strip` method removes any extra spaces and the newline character `\n`. The `split` method breaks a string into a list using the given separator (`,` in this case)

In [25]:
file3_lines[0]

'amount,duration,rate,down_payment\n'

In [26]:
headers = parse_headers(file3_lines[0])

In [27]:
headers

['amount', 'duration', 'rate', 'down_payment']

Next, let's define a function `parse_values` that takes a linne containing some data and returns a list of floating point numbers.

In [28]:
line = file3_lines[0]

In [30]:
line = line.strip()
line.split(',')

['amount', 'duration', 'rate', 'down_payment']

In [31]:
def parse_values(data_line):
  values = []
  for item in data_line.strip().split(','):
    values.append(float(item))
  return values

In [32]:
file3_lines[1]

'100000,36,0.08,20000\n'

In [33]:
parse_values(file3_lines[1])

[100000.0, 36.0, 0.08, 20000.0]

The values were parsed and converted to floating point numbers, as expected. Let's try it for another line from the file, which does not contain a value for the down payment.

In [34]:
file3_lines[2]

'200000,12,0.1,\n'

In [35]:
parse_values(file3_lines[2])

ValueError: ignored

The code above leads to a `ValueError` because the emply string `''` cannot be converted to a float. We can enhance the  `parse_values` function to handle this edge case. We will also handle the case where the value is not a float.

In [36]:
def parse_values(data_line):
  values = []
  for item in data_line.strip().split(','):
    if item == '':
      values.append(0.0)
    else:
      try: 
        values.append(float(item))
      except ValueError:
        values.append(item)
    return values

In [37]:
file3_lines[2]

'200000,12,0.1,\n'

In [38]:
parse_values(file3_lines[2])

[200000.0]

Next, let's define a function `creae_item_dict` that takes a list of values and a list of headers as inputs and returns a dictionary with the values associated with their respective headers as keys.

In [39]:
def create_item_dict(values, headers):
  result = {}
  for value, header in zip(values, headers):
    result[header] = value
  return result

In [40]:
for item in zip([1, 2, 3], ['a', 'b', 'c']):
  print(item)

(1, 'a')
(2, 'b')
(3, 'c')


Let's try out `create_item_dict` with a couple of examples.

In [41]:
file3_lines[1]

'100000,36,0.08,20000\n'

In [42]:
values1 = parse_values(file3_lines[1])
create_item_dict(values1, headers)

{'amount': 100000.0}

In [43]:
file3_lines[2]

'200000,12,0.1,\n'

In [44]:
values2 = parse_values(file3_lines[2])
create_item_dict(values2, headers)

{'amount': 200000.0}

As expected, the values & header are combined to create a dictionary with the appropriate key-value pairs. We are now ready to put it all together and define the `read_csv` function.

In [45]:
def read_csv(path):
  result = []
  # Open the file in read mode
  with open(path, 'r') as f:
    # Get a list of lines
    lines = f.readlines()
    # Parse the header
    headers = parse_headers(lines[0])
    # Loop over the remaining lines
    for data_line in lines[1:]:
      # Parse the values
      values = parse_values(data_line)
      # Create a dictionary using values & headers
      item_dict = create_item_dict(values, headers)
      # Add the dictionary to the result
      result.append(item_dict)
    return result

Let's try it out!

In [52]:
with open('./data/loans2.txt') as file2:
    print(file2.read())

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


In [51]:
read_csv('./data/loans2.txt')

[{'amount': 100000.0},
 {'amount': 200000.0},
 {'amount': 628400.0},
 {'amount': 4637400.0},
 {'amount': 42900.0},
 {'amount': 916000.0},
 {'amount': 45230.0},
 {'amount': 991360.0},
 {'amount': 423000.0}]

The file is read and converted to a list of dictionaries, as expected. The `read_csv` file is generic enough that it can parse any file in the CSV format, with any number of rows or columns. Here's the full code for `reda_csv` along with the helper functions:

In [53]:
def parse_headers(header_line):
  return header_line.strip().split(',')

def parse_values(data_line):
  values = []
  for item in data_line.strip().split(','):
    if item == '':
      values.append(0.0)
    else:
      try:
        values.append(float(item))
      except ValueError:
        values.append(item)
  return values

def crate_item_dict(values, headers):
  result = {}
  for value, header in zip(values, headers):
    result[header] = value
  return result


def read_csv(path):
  result = []
  # Open the file in read mode
  with open(path, 'r') as f:
    # Get a list of lines
    lines = f.readlines()
    # Parse the header
    headers = parse_headers(lines[0])
    # Loop over the remaining lines
    for data_line in lines[1:]:
      # Parse the values
      values = parse_values(data_line)
      #Create a dictionary using values & headers
      item_dict = create_item_dict(values, headers)
      # Add the dictionary to the result
      result.append(item_dict)
  return result

Try to create small, generic, and reusable functions whenever possible. They will likely be useful beyond just the problem at hand and save you significant effort in the future.

In [54]:
import math

def loan_emi(amount, duration, rate, down_payment=0):
  # Calculates the equal monthly installment (EMI) for a loan.
  """ Arguments:
    amount - Total amount to be spent (loan + down payment)
    duration - Duration of the loan(in months)
    rate - Rate of interest (monthly)
    down_payment (optional) - Optional intial payment (deducted from amount)
    """

  loan_amount = amount - down_payment
  try:
      emi = loan_amount * rate * ((1+rate)**duration) / (((1+rate)**duration)-1)
  except ZeroDivisionError:
      emi = loan_amount / duration
  
  emi = math.ceil(emi)
  return emi

     

We can use this function to calculate EMIs for all loans in a file.

In [55]:
loans2 = read_csv('./data/loans2.txt')

In [57]:
loans2

[{'amount': 100000.0, 'duration': 36.0, 'rate': 0.08, 'down_payment': 20000.0},
 {'amount': 200000.0, 'duration': 12.0, 'rate': 0.1, 'down_payment': 0.0},
 {'amount': 628400.0,
  'duration': 120.0,
  'rate': 0.12,
  'down_payment': 100000.0},
 {'amount': 4637400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 42900.0, 'duration': 90.0, 'rate': 0.07, 'down_payment': 8900.0},
 {'amount': 916000.0, 'duration': 16.0, 'rate': 0.13, 'down_payment': 0.0},
 {'amount': 45230.0, 'duration': 48.0, 'rate': 0.08, 'down_payment': 4300.0},
 {'amount': 991360.0, 'duration': 99.0, 'rate': 0.08, 'down_payment': 0.0},
 {'amount': 423000.0, 'duration': 27.0, 'rate': 0.09, 'down_payment': 47200.0}]

In [58]:
for loan in loans2:
  loan['emi'] = loan_emi(loan['amount'],
                         loan['duration'],
                         loan['rate']/12, # the CSV contains yearly rates
                         loan['down_payment'])

In [59]:
loans2

[{'amount': 100000.0,
  'duration': 36.0,
  'rate': 0.08,
  'down_payment': 20000.0,
  'emi': 2507},
 {'amount': 200000.0,
  'duration': 12.0,
  'rate': 0.1,
  'down_payment': 0.0,
  'emi': 17584},
 {'amount': 628400.0,
  'duration': 120.0,
  'rate': 0.12,
  'down_payment': 100000.0,
  'emi': 7582},
 {'amount': 4637400.0,
  'duration': 240.0,
  'rate': 0.06,
  'down_payment': 0.0,
  'emi': 33224},
 {'amount': 42900.0,
  'duration': 90.0,
  'rate': 0.07,
  'down_payment': 8900.0,
  'emi': 487},
 {'amount': 916000.0,
  'duration': 16.0,
  'rate': 0.13,
  'down_payment': 0.0,
  'emi': 62664},
 {'amount': 45230.0,
  'duration': 48.0,
  'rate': 0.08,
  'down_payment': 4300.0,
  'emi': 1000},
 {'amount': 991360.0,
  'duration': 99.0,
  'rate': 0.08,
  'down_payment': 0.0,
  'emi': 13712},
 {'amount': 423000.0,
  'duration': 27.0,
  'rate': 0.09,
  'down_payment': 47200.0,
  'emi': 15428}]

You can see that each loan now has a new key `emi`, which provides the EMI for the loan. We can extract this logic into a function so that we can use it for other files too.

In [60]:
def compute_emis(loans):
  for loan in loans:
    loan['emi'] = loan_emi(
        loan['amount'],
        loan['duration'],
        loan['rate']/12, # the CSV contains yearly rates
        loan['down_payment']
    )

#Writing to files
Now that we have performed some processing on the data, it would be good to write the results back to a CSV file. We can create/open a file in `w` mode using `open` and write to it using the `.write` method. The stirng `format` method will come in handy here.

In [61]:
loans2 = read_csv('./data/loans2.txt')

In [62]:
compute_emis(loans2)

In [63]:
loans2

[{'amount': 100000.0,
  'duration': 36.0,
  'rate': 0.08,
  'down_payment': 20000.0,
  'emi': 2507},
 {'amount': 200000.0,
  'duration': 12.0,
  'rate': 0.1,
  'down_payment': 0.0,
  'emi': 17584},
 {'amount': 628400.0,
  'duration': 120.0,
  'rate': 0.12,
  'down_payment': 100000.0,
  'emi': 7582},
 {'amount': 4637400.0,
  'duration': 240.0,
  'rate': 0.06,
  'down_payment': 0.0,
  'emi': 33224},
 {'amount': 42900.0,
  'duration': 90.0,
  'rate': 0.07,
  'down_payment': 8900.0,
  'emi': 487},
 {'amount': 916000.0,
  'duration': 16.0,
  'rate': 0.13,
  'down_payment': 0.0,
  'emi': 62664},
 {'amount': 45230.0,
  'duration': 48.0,
  'rate': 0.08,
  'down_payment': 4300.0,
  'emi': 1000},
 {'amount': 991360.0,
  'duration': 99.0,
  'rate': 0.08,
  'down_payment': 0.0,
  'emi': 13712},
 {'amount': 423000.0,
  'duration': 27.0,
  'rate': 0.09,
  'down_payment': 47200.0,
  'emi': 15428}]

In [64]:
with open('./data/emis2.txt', 'w') as f:
  for loan in loans2:
    f.write('{}, {}, {}, {}, {}\n'.format(
        loan['amount'],
        loan['duration'],
        loan['rate'],
        loan['down_payment'],
        loan['emi']
    ))

Let's verify that the file was created and written to as expected.

In [65]:
os.listdir('data')

['loans1.txt', 'loans3.txt', 'emis2.txt', 'loans2.txt']

In [66]:
with open('./data/emis2.txt', 'r') as f:
  print(f.read())

100000.0, 36.0, 0.08, 20000.0, 2507
200000.0, 12.0, 0.1, 0.0, 17584
628400.0, 120.0, 0.12, 100000.0, 7582
4637400.0, 240.0, 0.06, 0.0, 33224
42900.0, 90.0, 0.07, 8900.0, 487
916000.0, 16.0, 0.13, 0.0, 62664
45230.0, 48.0, 0.08, 4300.0, 1000
991360.0, 99.0, 0.08, 0.0, 13712
423000.0, 27.0, 0.09, 47200.0, 15428



Great, looks like the loan details (along with the computed EMIs) were written into the file.
Let's define a generic function `write_csv` which takes a list of dictionaries and writes it to a file in CSV format. We will also include thecolumn headers in the first line.

In [67]:
def write_csv(items, path):
  # Open the file in write mode
  with open(path, 'w') as f:
    # Return if there's nothing to write
    if len(items) == 0:
      return
    #  Write the headers in the first line
    headers = list(items[0].keys())
    f.write(','.join(headers) + '\n')

    # Write one item per line
    for item in items:
      values = []
      for header in headers:
        values.append(str(item.get(header, "")))
      f.write(','.join(values) + '\n')

In [68]:
loans3 = read_csv('./data/loans3.txt')

In [69]:
compute_emis(loans3)

In [70]:
write_csv(loans3, './data/emis3.txt')

In [71]:
with open('./data/emis3.txt', 'r') as f:
  print(f.read())

amount,duration,rate,down_payment,emi
100000.0,36.0,0.08,20000.0,2507
200000.0,12.0,0.1,0.0,17584
628400.0,120.0,0.12,100000.0,7582
4637400.0,240.0,0.06,0.0,33224
42900.0,90.0,0.07,8900.0,487
916000.0,16.0,0.13,0.0,62664
45230.0,48.0,0.08,4300.0,1000
991360.0,99.0,0.08,0.0,13712
423000.0,27.0,0.09,47200.0,15428



With just four lines of code, we can now read each downloaded file, calculate the EMIs, and write the results back to new files.

In [73]:
for i in range(1,4):
  loans = read_csv('./data/loans{}.txt'.format(i))
  compute_emis(loans)
  write_csv(loans, './data/emis{}.txt'.format(i))

In [74]:
os.listdir('./data')

['loans1.txt',
 'emis1.txt',
 'loans3.txt',
 'emis3.txt',
 'emis2.txt',
 'loans2.txt']

Isn't that wonderful? Once all the functions are defined, we can calculate EMIs for thousands or even millions of loans across many files in seconds with just a few lines of code. Now we're starting to see the real power of using a programming language like Python for processing data!

#Using Pandas to Read and Write CSVs
There are some limitations to the `read_csv` and `write_csv` functions we've defined above:
* The `read_csv` function fails to crate a proper dictionary if any of the values in the CSV files contains commans

* The `write_csv` function fails to create a proper CSV if any of the values to be written conatains commas

When a value in a CSV file contains a comma(`,`), the value is generally placed within double quotes. Double quotes (`"`) in values are converted into two double quotes (`""`). Here's an example:

In [75]:
movies_url = "https://gist.githubusercontent.com/aakashns/afee0a407d44bbc02321993548021af9/raw/6d7473f0ac4c54aca65fc4b06ed831b8a4840190/movies.csv"

In [76]:
urlretrieve(movies_url, 'data/movies.csv')

('data/movies.csv', <http.client.HTTPMessage at 0x7f403e35db50>)

In [77]:
movies = read_csv('data/movies.csv')

In [78]:
movies

[{'title': 'Fast & Furious', 'description': '"A movie'},
 {'title': 'The Dark Knight', 'description': '"Gotham'},
 {'title': 'Memento',
  'description': 'A guy forgets everything every 15 minutes'}]

As you can seen above, the movie descriptions weren't parsed properly.
To read this CSV properly, we can use the `pandas` library.

In [79]:
import pandas as pd

The `pd.read_csv` function can be use used to read the CSV file into a pandas data frame: a spreadsheet-like object for analyzing and processing data. We'll learn more about data frames in a future lesson.

In [84]:
movies_dataframe = pd.read_csv('data/movies.csv')

In [85]:
movies_dataframe


Unnamed: 0,title,description
0,Fast & Furious,"A movie, a race, a franchise"
1,The Dark Knight,"Gotham, the ""Batman"", and the Joker"
2,Memento,A guy forgets everything every 15 minutes


If you don't pass the arguments `records`, you get a dictionary of lists instead.


In [86]:
movies_dict = movies_dataframe.to_dict()

In [90]:
movies_dict

{'title': {0: 'Fast & Furious', 1: 'The Dark Knight', 2: 'Memento'},
 'description': {0: 'A movie, a race, a franchise',
  1: 'Gotham, the "Batman", and the Joker',
  2: 'A guy forgets everything every 15 minutes'}}

Let's try using the `write_csv` function to write the data in movies back to a CSV file.

In [91]:
write_csv(movies, 'movies2.csv')

In [92]:
!head movies2.csv

title,description
Fast & Furious,"A movie
The Dark Knight,"Gotham
Memento,A guy forgets everything every 15 minutes


As you can see above, the CSV file is not formatted properly. This can be verified by attempting to read the file using `pd.read_csv`

In [93]:
pd.read_csv('movies2.csv')

Unnamed: 0,title,description
0,Fast & Furious,"A movie\nThe Dark Knight,Gotham"
1,Memento,A guy forgets everything every 15 minutes


To convert a list of dictionaries into a dataframe, you can use the `pd.DataFrame` constructior.

In [94]:
df2 = pd.DataFrame(movies)

In [95]:
df2

Unnamed: 0,title,description
0,Fast & Furious,"""A movie"
1,The Dark Knight,"""Gotham"
2,Memento,A guy forgets everything every 15 minutes


It can now be written to a CSV file using the `.to_csv` method of a dataframe.

In [96]:
df2.to_csv('movies3.csv', index=None)

Can you guess what the argument `index=None` does? Try removing it and observing the difference in output.

In [97]:
!head movies3.csv

title,description
Fast & Furious,"""A movie"
The Dark Knight,"""Gotham"
Memento,A guy forgets everything every 15 minutes


The CSV file is formatted properly. We can verify this by trying to read it back.

In [98]:
pd.read_csv('movies3.csv')

Unnamed: 0,title,description
0,Fast & Furious,"""A movie"
1,The Dark Knight,"""Gotham"
2,Memento,A guy forgets everything every 15 minutes
