## Ivan Lee Cancino  -  A01793491

# MODULE 4: Working with Data in Python

## Reading Files with Open

One way to read or write a file in Python is to use the built-in open function. The open function provides a File object that contains the methods and attributes you need in order to read, save, and manipulate the file.
- The mode argument is optional and the default value is r
    - **r**: Read mode for reading files
    - **w**: Write mode for writing files

In [3]:
# Open the file Example1.txt
example1 = "Example1.txt"
file1 = open(example1, "r")

In [4]:
# Print the path of file
print(file1.name)
# Print the mode of file, either 'r' or 'w'
print(file1.mode)

Example1.txt
r


In [5]:
# We can read the file and assign it to a variable
FileContent = file1.read()
FileContent

'This is Line 1\nThis is Line 2\nThis is Line 3'

In [6]:
# Type of file content:
type(FileContent)

str

In [7]:
# We can print the file contents:
print(FileContent)

This is Line 1
This is Line 2
This is Line 3


It is very important that the file is closed in the end. This frees up resources and ensures consistency across different python versions.

In [8]:
# Close file after finishing:
file1.close()

Another (better) way to read a file:
- Using the **with** statement is better practice, it automatically closes the file even if the code encounters an exception. The code will run everything in the indent block then close the file object.

In [10]:
# open the file using with, and print its content
with open(example1, "r") as file1:
    FileContent = file1.read()
    print(FileContent)

This is Line 1
This is Line 2
This is Line 3


In [12]:
# Verify that the file is closed: 
file1.closed

True

In [13]:
# We can read the first characters of the file
# for example by passing 4 as a parameter to the .read() method: 
with open(example1, "r") as file1:
    print(file1.read(4))

This


In [15]:
# If we call the method again, the next 4 characters are called, and so on:
with open(example1, "r") as file1:
    print(file1.read(4))
    print(file1.read(3))
    print(file1.read(6))
    print(file1.read(16))

This
 is
 Line 
1
This is Line 2


In [17]:
# We can also read one line of the file at a time using the method readline()
with open(example1, "r") as file1:
    print("first line: " + file1.readline())

first line: This is Line 1



In [18]:
# We can also pass an argument to  readline()  to specify the number of charecters we want to read
# however, readline() can only read 1 line at most.
with open(example1, "r") as file1:
    print(file1.readline(20)) # does not read past the end of line
    print(file1.read(20)) # Returns the next 20 chars

This is Line 1

This is Line 2
This 


We can use a loop to iterate through each line:

In [20]:
# Iterate through the lines:
with open(example1,"r") as file1:
    i = 0
    for line in file1:
        print("Iteration " + str(i) + ":", line)
        i = i + 1

Iteration 0: This is Line 1

Iteration 1: This is Line 2

Iteration 2: This is Line 3


We can use the method readlines() to save the text file to a list:

In [23]:
# Read all lines and save as a list
# Each element of the list corresponds to a line of text:

with open(example1, "r") as file1:
    FileasList = file1.readlines()

print(FileasList[0])
print(FileasList[1])
print(FileasList[2])

This is Line 1

This is Line 2

This is Line 3


## Writing Files with Open

We can open a file object using the method write() to save the text file to a list. 
To write to a file, the mode argument must be set to **w**.

In [24]:
# Write line to a new file:
exmp2 = "Example2.txt"
with open(exmp2, "w") as writefile:
    writefile.write("This is Line A")

In [25]:
# We can check if the file was created
# and check if the line was written:

with open(exmp2, "r") as testwritefile:
    print(testwritefile.read())

This is Line A


In [28]:
# Write more lines to the file:

with open(exmp2, "w") as writefile:
    writefile.write("Tis is Line A\n")
    writefile.write("Tis is Line B\n")

In [29]:
# Check if the lines were written
with open(exmp2, "r") as testwritefile:
    print(testwritefile.read())


Tis is Line A
Tis is Line B



In [30]:
# We can create a list and write its elements to a file:
Lines = ["This is line A\n", "This is line B\n", "This is line C\n"]
Lines

['This is line A\n', 'This is line B\n', 'This is line C\n']

In [31]:
# Write the strings in the list to text file
with open('Example2.txt', 'w') as writefile:
    for line in Lines:
        print(line)
        writefile.write(line)

This is line A

This is line B

This is line C



Note that setting the mode to **w** overwrites all the existing data in the file.

In [33]:
with open('Example2.txt', 'w') as writefile:
    writefile.write("Overwrite\n")
    
with open('Example2.txt', 'r') as testwritefile:
    print(testwritefile.read())

Overwrite



**Appending Files**

We can write to files without losing any of the existing data as follows by setting the mode argument to append: **a**

In [34]:
# Append new lines to text file
with open('Example2.txt', 'a') as testwritefile:
    testwritefile.write("This is Line C\n")
    testwritefile.write("This is Line D\n")
    testwritefile.write("This is Line E\n")

# Verify if the new line is in the text file
with open('Example2.txt', 'r') as testwritefile:
    print(testwritefile.read())

Overwrite
This is Line C
This is Line D
This is Line E



**Additional modes**

We can access the file in the following modes:
- **r+** : Reading and writing. Cannot truncate the file.
- **w+** : Writing and reading. Truncates the file.
- **a+** : Appending and Reading. Creates a new file, if none exists.

In [1]:
# Using a+ mode:
with open('Example2.txt', 'a+') as testwritefile:
    testwritefile.write("This is Line E\n")
    print(testwritefile.read())




In the previous line .read() did not output anything, that´s because of our location in the file. Most of the file methods we've looked at work in a certain location in the file. .write()  writes at a certain location in the file. .read() reads at a certain location in the file and so on.

Opening the file in **w** is akin to opening the .txt file, moving your cursor to the beginning of the text file, writing new text and deleting everything that follows. Whereas opening the file in **a** is similiar to opening the .txt file, moving your cursor to the very end and then adding the new pieces of text.
It is often very useful to know where the 'cursor' is in a file and be able to control it.
- .tell() - returns the current position in bytes
- .seek(offset,from) - changes the position by 'offset' bytes with respect to 'from'. From can take the value of 0, 1,2 corresponding to beginning, relative to current position and end

In [2]:
# Example with a+
with open('Example2.txt', 'a+') as testwritefile:
    print("Initial Location: {}".format(testwritefile.tell()))
    
    data = testwritefile.read()
    if (not data):  #empty strings return false in python
            print('Read nothing') 
    else: 
            print(testwritefile.read())
            
    testwritefile.seek(0,0) # move 0 bytes from beginning.
    
    print("\nNew Location : {}".format(testwritefile.tell()))
    data = testwritefile.read()
    if (not data): 
            print('Read nothing') 
    else: 
            print(data)
    
    print("Location after read: {}".format(testwritefile.tell()) )

Initial Location: 70
Read nothing

New Location : 0
Overwrite
This is Line C
This is Line D
This is Line E
This is Line E

Location after read: 70


**Copy a File**

In [3]:
# Copy Example2 to new file Example3
with open("Example2.txt", "r") as readfile:
    with open("Example3.txt", "w") as testwritefile:
        for line in readfile:
            testwritefile.write(line)

In [4]:
# Check if file was copied
with open("Example3.txt", "r") as testreadfile:
    print(testreadfile.read())

Overwrite
This is Line C
This is Line D
This is Line E
This is Line E



We can also write data into files and save them in different file formats like .txt, .csv, .xls (for excel files) etc. You will come across these in further examples

## Loading Data with Pandas

Import the Pandas library to start working

In [15]:
import pandas as pd

#create a data frame 
df = pd.DataFrame({"a": [1,2,1], "b": [1,1,1]})
df

Unnamed: 0,a,b
0,1,1
1,2,1
2,1,1


In [16]:
# find the unique values in column "a":

df["a"].unique()

array([1, 2])

In [17]:
# Return a dataframe with only the rows where column  a  is less than two:
df[df["a"]<2]

Unnamed: 0,a,b
0,1,1
2,1,1


Read data from a csv file using .read_csv

In [19]:
# Read the csv file in our local directory
df = pd.read_csv("TopSellingAlbums.csv")

#print the first 5 rows:
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0


In [21]:
# Access the column 'Length'

df[["Length"]] # use double brackets to output a dataframe

Unnamed: 0,Length
0,0:42:19
1,0:42:11
2,0:42:49
3,0:57:44
4,0:46:33
5,0:43:08
6,1:15:54
7,0:40:01


In [22]:
# To output a column as a series, use one bracket:

df["Length"]

0    0:42:19
1    0:42:11
2    0:42:49
3    0:57:44
4    0:46:33
5    0:43:08
6    1:15:54
7    0:40:01
Name: Length, dtype: object

In [23]:
# Access to multiple columns:

Y = df[["Artist", "Album", "Genre"]]
Y

Unnamed: 0,Artist,Album,Genre
0,Michael Jackson,Thriller,"pop, rock, R&B"
1,AC/DC,Back in Black,hard rock
2,Pink Floyd,The Dark Side of the Moon,progressive rock
3,Whitney Houston,The Bodyguard,"R&B, soul, pop"
4,Meat Loaf,Bat Out of Hell,"hard rock, progressive rock"
5,Eagles,Their Greatest Hits (1971-1975),"rock, soft rock, folk rock"
6,Bee Gees,Saturday Night Fever,disco
7,Fleetwood Mac,Rumours,soft rock


The  **iloc** method is used to access unique elements

In [24]:
# Get the element in the first row and first column:

df.iloc[0,0]

'Michael Jackson'

In [25]:
# Access the value on the second row and the third column:

df.iloc[1,2]

1980

The  **loc** method is used to access unique elements by the label name

In [26]:
# Access the column using the name

df.loc[1, "Genre"]

'hard rock'

In [27]:
df.loc[0,"Artist"]

'Michael Jackson'

You can perform slicing using both the index and the name of the column:

In [28]:
df.iloc[0:3,0:3]

Unnamed: 0,Artist,Album,Released
0,Michael Jackson,Thriller,1982
1,AC/DC,Back in Black,1980
2,Pink Floyd,The Dark Side of the Moon,1973


In [29]:
df.loc[0:3, "Artist":"Released"]

Unnamed: 0,Artist,Album,Released
0,Michael Jackson,Thriller,1982
1,AC/DC,Back in Black,1980
2,Pink Floyd,The Dark Side of the Moon,1973
3,Whitney Houston,The Bodyguard,1992


We can change the index values using the method **.index**

In [31]:
# our data frame has 8 rows
# convert the indexes to letters

new_index = ['a','b','c','d','e','f','g','h']  # we use a list

df_new = df
df_new.index = new_index
df_new

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
a,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
b,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
c,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
d,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
e,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0
f,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,17-Feb-76,,7.5
g,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,15-Nov-77,Y,7.0
h,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,04-Feb-77,,6.5


In [35]:
# Access an element using the new index
df_new.loc["a","Album"]

'Thriller'