# NCRM April 2024 - Intro to Python for Data Analysis
### Session 4 - part 1: Import, functions, and reading/writing files

![image.png](attachment:image.png)

## Lewys Brace
#### l.brace@exeter.ac.uk

## Importing modules

In session 1, we spoke about Python's "batteries included' philosophy and at the end of the last session we used the following line of code:

In [None]:
import copy

The code line above is referred to as an "import statement" and is used to import modules into your code.

Modules are piece of code that contains, sometimes many, different functions to carry out specific tasks. As we saw in the last session, the **copy** module contains a function that allows us to create an exact copy of nested data structutres; **deepcopy()**. 

Modules have many different functions for carrying out specific tasks, meaning that you do not need to create code for every single task you wish to carry out. Instead, you can import pre-built pieces of code, known as packages, to carry out such a task.

Any Python file can technically be referred to as a module. A Python file called **hello.py** has the module name of **hello** that can be imported into other Python files or used on the Python command line interpreter.

Given Python's open-source nature, there are now modules available to do the vast majority of tasks that you are likely to need to do during your career as a computational social scientists.

This is one of the features of Python that make it so popular.

However, before we can use a module's functions, we first need to enable the Python script that we are currently working on to access it. This is where the **import** statement comes in.

For example, the code belows imports the **math** module and then calls its **pi** function to print the value of pi to the screen:

In [None]:
import math
pie = math.pi
print("The value of pi is : ", pie)

When Python imports a module called hello for example, the interpreter will first search for a built-in module called hello. If a built-in module is not found, the Python interpreter will then search for a file named hello.py in a list of directories that it receives from the sys.path variable.

As another example, the below code first imports the **datetime** module so that the current Python script can access the **datetime** module's functions. It then calls the **now()** sub-function from the **datetime** function of the **datetime** module.

In [None]:
import datetime
current_time = datetime.datetime.now()
print(current_time)

If, however, you intend to use the same function from a specific module multiple times, you can instead use the **from [...] import [...]** staement to only import the specific function you require.

For example, in the **datetime** example above, to get the current date and time (as dictated by our machine's internal clock), we had to type the name of the package (**datetime**), then the function we wanted from that package (also called **datetime**) and then the sub-function of **now()**.

if we only want the **datetime** function, however, we can just do the below:

In [None]:
from datetime import datetime
current_time = datetime.now()
print(current_time)

This works for all modules:

In [None]:
from math import pi
print("The value of pi is : ", pi)

It is possible to modify the names of modules and their functions within Python by using the **as** keyword.

You may want to change a name because you have already used the same name for something else in your programme, another module you have imported also uses that name, or you may want to abbreviate a longer name that you are using a lot.

The construction of this statement looks like the following: **import [...] as [...]**

There are a number of modules that are very commonly used within the data science community and which are often assigned shortened alias on import by convention.

FFor example, one of the modules you will use the most in computational social science is the NumPy module. We'll discuss this in-depth in a future session, but for now, it serve for you to know that this users conventionally import this in the following way with the following alias:

In [None]:
import numpy as np

This means that when we want ot use functions from the NumPy module, we don't have to type the whole module name as follows:

In [None]:
import numpy
array = numpy.array([1,2,3])

Instead, we can just type "np":

In [None]:
import numpy as np
array = np.array([1,2,3])

In some coding examples online, you might see imports that use the wildcard operator: *
    
For example:

In [None]:
from math import *

Importing this way should be avoided.

This is because wildcard imports make it unclear which names are present in the namespace, confusing both readers and many automated tools.

Finally, when importing modules, it is considered best practice to have one import statement per line and to group these together at the top of your Python script; i.e.:

In [None]:
import numpy as np
import math
from datetime import datetime

## Exercise 1
Do the first exercise in the exercises Jupyter notebook for this session.

## Installing modules

When you installed the Anaconda distribution of Python, you also installed the Python Standard Library.

The Python Standard Library contains many modules that provide access to system functionality or provide standardised solutions.

This means that certain modules, such as **math**, can be imported into your python code right away as they are built-in modules:

In [None]:
import math

However,many modules that you are likely to use are not built-in.

If you try to import a non-built-in module, you'll receive an error message like the one below:

> Output
ImportError: No module named ['module name']

This means that we first need to install this module onto our system before we can import it into our current Python script.

To do that, we open the Anaconda prompt and use pip install, as we discussed in session 1.

Once we are in the Anaconda prompt window, we then type the following:

> pip install ['module name']

This will then use pip to install the module your name.

For example, pretend you wanted to install the **sqlite3** module, you would open the Anaconda prompt and type the following:


![image.png](attachment:image.png)

The sqlite3 module would then be installed. Once complete, you can then import it into your python script in the same way as you did the built-in module above.

**Note:** You will only eve rhave to install a module once, the first time you wish to use, on each machine. Once you have installed it once, during future projects, you can just import it as normal.

## Custom functions

So far on this course, we have used functions that are in-built into Python, i.e. **print()** and **len()**, and looked at functions that we use from imported modules; i.e. **datetime.datetime.now()**.

As we have seen, the in-built Python functions take an input, perfom some operations, and then return an output; i.e.:

In [None]:
my_string = "University of Exeter"
print(len(my_string))

We could also have computed the length of the string without using the in-built function:

In [None]:
my_string = "University of Exeter"
length = 0
for element in my_string:
    length += 1
print(length)

Both methods achive the same result, but using the in-built **len()** function is less work.

This makes use of an abstraction that is commong in modern software, whereby we convert repeated tasks and text into condensed and easy to use tools. Indeed, modern software is built upon decades of such abstractions.

Perhaps the most major abstraction concerns how all data stored on computers is in binary form; a series of 0 and 1. For example, the following phrase:

> Hello, my name is Lewys

Is not actually represented that way within the computer. Instead, it is represented by the following sequence:

> 1001000 1100101 1101100 1101100 1101111 101100 100000 1101101 1111001 100000 1101110 1100001 1101101 1100101 100000 1101001 1110011 100000 1001100 1100101 1110111 1111001 1110011

When we save a piece of text like this within a document to our computer's HDD, which is divided into billions of tiny magnetic regions, each region has its polarity left or change to be positive or negative; this corresponds to 1 or 0, respectively.

The same process happens when we communicate via technology:

![image-4.png](attachment:image-4.png)

Our text is converted into binary by whichever device we are using, transmistted, and then converted back on the device you sent it to.

However, we obviously do not use computers by typing our code, documents, or altering images in binary. Instead, we have abstracted this process to save us time and brain power. In the process, we have also made these technologies easier to use; look how quick you're learning Python due to it being a high-level programming language!

This kind of abstraction is the same logic that underpins in-built or module-imported Python functions.

However, sometimes you will find that you have small tasks or problems that you often have to repeat, and for which there is not an in-built or module-importable function that is applicable.

In these cases, it is worth building our own functions. Indeed, depending on the task at hand, sometimes it is just easier to do this than finding a suitable module to import.

While this sounds difficult, creating a function in Python is actually very easy.

To define a Python function, we begin by typing the keyword **def** which allows us to define a function.  This is then followed by parentheses containing any arguments the function will take, and then a **:**

The function below just numbers together. This function takes two variables as inputs, the variables **a** and **b**. Then returns the result of the multiplication by assigning it to the answer variable, then feeding that into the **return** function.

In [None]:
def practice_function(a, b):
    answer = a * b
    return answer

#Create two variables to test our function.
x = 5
y = 4

The calculated variable is the result of multiplying the two numbers together.

The function then takes our two variables (note that the two variable names fed into the function call do not need to match those in the function definition, but they are fed in in sequential order); i.e. x becomes a and y becomes b. 

The "answer" that is returned is then assigned to "calculated".

In [None]:
calculated = practice_function(x, y)
print(calculated)

Note: Remember that Python is sequential, this means you have to define your function before you can call it.

For example, the below would not work because python is reading the "call" statement before it reads the code to define the function:

In [None]:
calculated = practice_function_2(x, y)
print(calculated)

def practice_function_2(a, b):
    answer = a * b
    return answer

#Create two variables to test our function.
x = 5
y = 4

A function can also have multiple returns.

The only difference is that, when you call the function, you have to provide multiple variable names.

The **return** function will assign the variables fed into it to the variable names in the call statement in sequential order.

In [None]:
def practice_function(a, b):
    answer = a * b
    answer2 = answer * answer
    return answer, answer2

x = 5
y = 4

calculated, calculated2 = practice_function(x, y)
print("First output: ", calculated)
print("First output: ", calculated2)

User-defined functions can be as simple or as complex as we like, although we should always aim to make them as simple as possible.

The first, and most important step, is to know what problem you are trying to solve.

if you understand the problem, you can then break it down into smaller problems, some of which might be repeated many times.

Rather than writing an anormous block of code that handles everything, you can instead write several sub-functions that individually handle these smaller problems, allowing you to better organise your approach.

We will be creating and looking at a number of user-defined functions throughout this course.

## Exercise 2
Do the second set of exercises in the Jupyter notebook of exercises for this session.



## Reading and writing files: The file object
File handling in Python can easily be done with the built-in file object.

Before we look at the in-built file object, however, it is worth re-capping files paths from session 1.

File paths are crucial for anyone who is working with computer code to understand.

Technically speaking, a file path is a string of characters used to identify a location in a directory structure.

They are composed in such a way that components (folders) are separated between levels by a delimitating character.

The delimitating character is usually a **\** or **:**

These paths represent directory/file relationships.

In a non-technical sense, they are a list of increasingly sub-directories that 

In windows, an absolute file path looks similar to the following:

> C://Users/JohnSmith/Downloads/My_downloaded_file.csv

![image-3.png](attachment:image-3.png)


On a Mac, an absolute path looks like:

> /Users/JohnSmith/Downloads/My_downloaded_file.csv

![image-2.png](attachment:image-2.png)

Notice above, we use the term "absolute file path. remember from session 1 that the absolute path includes the root directory, and will therefore always point to the same location on a machine:

> Windows:
> C://Users/JohnSmith/Downloads/My_downloaded_file.csv
	
> Mac:
> /Users/JohnSmith/Downloads/My_downloaded_file.csv

In contrast, a relative path begins from some working directory, and therefore does not require an absolute path:

> Downloads/My_downloaded_file.csv

If the working directory is the most sub-folder, then the filename itself can constitute a relative path:

> My_downloaded_file.csv


## Exercise

Now that we all have a text file in a location on your system that we know, we can look at the in-built Python commands for interacting with them.

The file object provides all of the basic functions necessary in order to manipulate files; including reading, appending, and writing to files.

Lets start by looking at how to open a file in Python using the in-built **open()** function. Indeed, before you can work with a file, you have to open it.

The **open()** function takes two arguments; the name of the file that you wish to use and the mode for which we would like to open the file. 

There are a number of different file opening modes. The most common are:

> **'r’** = read

> **‘w’** = write

> **‘r+’** = both reading and writing

> **‘a’** = appending

The open() fucntion takes two arguments. 

The first is the file name in the form of a string, and this string has to contain two other bits of important information. The first is the file path. Remember, if the file you want to read in is not in your current working directory, then you will need to include the absolute file path to the file in the string (in the examples here, we will be using the absolute path). The second is the file type. Below we are using a text file (**.txt**) in our examples.

The second argument specifies what mode you want to open the file in, which is related to what you want to do with the file. By default, the **open()** function opens a file in ‘read mode’, and this is what the **‘r’** in the second argument position signifies.


In [None]:
test_file = open("C:/username/Documents/data_folder/test_text_file.txt", 'r')

Once you’re done working with a file, you can close it with the .close() function.

Using this function will free up any system resources that are being used up by having the file open and will limit the chances of you accidently deleting the data contained therein.

In [None]:
test_file.close()

Once you have opened a file using **open()**, there are several ways in which to load in the contents of a file.

One way is to iterate through every line in a file by using a for loop.

In [None]:
test_file = open("C:/username/Documents/data_folder/test_text_file.txt", 'r')
for line in test_file:
    print(line)
test_file.close()

However, Python has three in-built functions for reading in file contents.

The first is **.read()**, which returns the entire contents of the file as a single string.

In [None]:
test_file = open("C:/username/Documents/data_folder/test_text_file.txt", 'r')
print(test_file.read())
test_file.close()

The second is **.readline()**, which returns one line at a time.

In [None]:
test_file = open("C:/username/Documents/data_folder/test_text_file.txt", 'r')
print(test_file.readline())
test_file.close()

The third is **.readlines()**, which returns a list of lines

In [None]:
test_file = open("C:/username/Documents/data_folder/test_text_file.txt", 'r')
print(working_file.readlines())
working_file.close()

Just as there are multiple methods for reading in the contents of a file, there are two similar in-built methods for writing to a file from within your Python script.

The first is **.write()**, which writes a specified sequence of characters to a file. Do note that, when loading this file in, we want to write to it so we replace the **'r'** with a **'w'**.

In [None]:
working_file = open("C:/username/Documents/data_folder/working_file.txt", 'w')
working_file.write("Add this line of text to the file")
working_file.close()

The second is **.writelines()** which writes a list of strings to a file.

In [None]:
test_list = ["text piece 1", "text piece 2", "text piece 3"]

working_file = open("C:/username/Documents/data_folder/working_file.txt", 'w')
working_file.writelines(test_list)
working_file.close()

Important: Using the **write()** or **writelines()** function will overwrite anything contained within a file, if a file of the same name already exists in the working directory.

If you do not want to overwrite a file’s contents, you can use the append method.

To append to an existing file, simply put **‘a’** instead of **‘r’** or **‘w’** in the **open()** when opening a file.

In [None]:
working_file = open("C:/username/Documents/data_folder/working_file.txt", 'a')
working_file.write("Add this line of text to the file without overwriting")
working_file.close()

## End of session 4 - part 1