# Load Data
Run this code block to generate data

In [3]:
import numpy as np
import pandas as pd
x = pd.DataFrame({
    "reaction":["rct1","rct2","rct3","rct4","rct5"],
    "delta_G(kcal/mol)":[-41.22,-33.10,10.111,-31.4152,-8.111],
    "Scientist_name":["Ben","Ben","Steven","Ben","Steven"]
})
x.to_csv("reaction_data.csv")

def gen_rand_file(file_number:int):
    file_name = "large_data"+str(file_number)+".txt"
    x = np.random.randint(1,1000001,10000)
    x = list(x)
    x = [str(i)+"\n" for i in x]
    with open(file_name,"w") as file:
        file.writelines(x)

for i in range(1,11):
    gen_rand_file(i)


# Loops

you can recognize when you need a loop when you want to perform the same action a certain number of times on some set of objects.

## for loops

for loops are good when you want to complete a task and you know exactly how many times. For example lets say we wanted to print a list of 5 file versions. We could write the code below. 

In [58]:

print("file_name_V1.txt")
print("file_name_V2.txt")
print("file_name_V3.txt")
print("file_name_V4.txt")
print("file_name_V5.txt")

file_name_V1.txt
file_name_V2.txt
file_name_V3.txt
file_name_V4.txt
file_name_V5.txt


This isn't terrible, but it wouldn't scale very well and we could recognize the pattern and do something like this:

In [59]:
MIN = 1
MAX = 10
for i in range(MIN,MAX+1):
    print("file_name_V"+str(i)+".txt")

file_name_V1.txt
file_name_V2.txt
file_name_V3.txt
file_name_V4.txt
file_name_V5.txt
file_name_V6.txt
file_name_V7.txt
file_name_V8.txt
file_name_V9.txt
file_name_V10.txt


Not only is this less lines of code for more results we can easily change min and max later

for loops are also good when you want to itterate through something of a fixed length. Below is an example set of numbers. Lets pretend these are temperatures of a system in units kelvin and we want to print them in units Celsius 

In [60]:
numbers  = [9259.17, 706.09, 2518.31, 6829.44, 4464.02, 697.74, 8215.01]

print("measurement 1:",numbers[0]-273.15,"C")
print("measurement 2:",numbers[1]-273.15,"C")
print("measurement 3:",numbers[2]-273.15,"C")
print("measurement 4:",numbers[3]-273.15,"C")
# and then the rest of them

measurement 1: 8986.02 C
measurement 2: 432.94000000000005 C
measurement 3: 2245.16 C
measurement 4: 6556.29 C


Again this works but it doesn't scale very well and you risk mistyping a conversion or having to hunt a typo to get syntax correct. Lets try again with a for loop.

In [61]:
numbers  = [9259.17, 706.09, 2518.31, 6829.44, 4464.02, 697.74, 8215.01]

# I'm doing some slightly fancy stuff with a formattable string here. You can acheive the output format you want
# many different ways, but thought I might introduce it here.
template_string = "measurement {i}: {v:.2f} C"
for index, value in enumerate(numbers):
    formatted_string = template_string.format(i=index+1,v=value-273.15)
    print(formatted_string)

measurement 1: 8986.02 C
measurement 2: 432.94 C
measurement 3: 2245.16 C
measurement 4: 6556.29 C
measurement 5: 4190.87 C
measurement 6: 424.59 C
measurement 7: 7941.86 C


## Loop through files

It is common that we have large sets of data. Lets say Shubham sends you a file titled "large_data1.txt" (you can see it here in this folder.) He says "hey can you let me know how many times a number less than 532 inclussive occurs in this data? THANKS!". This file is 10,000 data points long. going through it manually would be awful. luckily this is a short for loop.

In [62]:
cutoff = 532

# Here I am opening the file and then storing the contents in a list.
with open("large_data1.txt") as file:
    large_data = file.readlines()

# this uses list processing to make sure the items are intigers, not strings
large_data = [int(i) for i in large_data]

# we need something to count instances
counter = 0

# finally the loop

for i in large_data:
    if i <=cutoff:
        counter = counter +1


# print out our result
print(counter)

# maybe even format it nicely
print("large_data1.txt contains", counter, "intigers less than", cutoff ,end=".")

8
large_data1.txt contains 8 intigers less than 532.

# Functions

This is great code, but lets say you need to write is multiple times. You keep performing this same task everywhere in your code. It would be nice if you could write it once and then never have to think about it again or if it wasn't working then you need want to just have to fix it in one spot. You are describing a function.

Lets take our last example where we counter instances of a number less than a cutoff in a large dataset. What if shubham gave us more data, or changed his mind about what number was the cutoff. It sould be nice to not have to go back and remember how our code works or have to find the spots to edit. Lets write a function that takes a file name and a cutoff and returns the number of instances less than that cuttoff


In [63]:
# we use the keyword `def` to specify a function
# our function names follow variable name rules.
# immediately after the function is parentheses, in those parenteses we specify arguments.
# in out problem statement we know we will take a file_name and a cutoff.
def less_than_cutoff(file_name,cutoff):
    counter = 0
    # all this code should look very similar from before but we are using the arguments that will be passed in as variables in this scope
    with open(file_name,"r") as file:
        large_data = file.readlines()
    large_data = [int(i) for i in large_data]

    for i in large_data:
        if i <= cutoff:
            counter = counter +1

    # Here we use the return keyword to specify what the function will output. I want my output to be a generally usable as ppossible.
    return counter

    

Now we can simply call our function on a data set and set the cutoff and get back to Shubham very quickly:

In [64]:
print(less_than_cutoff("large_data1.txt",532))

8


functions can actually be called inside other functions. This is why we wanted our output to be general earlier. we can make a function that returns a string output to out liking.

In [65]:
def report_ltc(file_name,cutoff):
    occurences = less_than_cutoff(file_name,cutoff)

    return f"{file_name} contains {occurences} intigers less than or equal to {cutoff}."

lets use our new report ltc function

In [66]:
print(report_ltc("large_data6.txt",32345))

large_data6.txt contains 301 intigers less than or equal to 32345.


pretty nifty, now lets see if we can loop through all the files using the glob package and get reports for all of them.

In [67]:
for file in glob.glob("*.txt"):
    print(report_ltc(file,495837))

large_data3.txt contains 4957 intigers less than or equal to 495837.
large_data9.txt contains 5010 intigers less than or equal to 495837.
large_data5.txt contains 5024 intigers less than or equal to 495837.
large_data4.txt contains 4971 intigers less than or equal to 495837.
large_data6.txt contains 4970 intigers less than or equal to 495837.
large_data1.txt contains 4975 intigers less than or equal to 495837.
large_data7.txt contains 4917 intigers less than or equal to 495837.
large_data2.txt contains 5025 intigers less than or equal to 495837.
large_data8.txt contains 5018 intigers less than or equal to 495837.
large_data10.txt contains 4979 intigers less than or equal to 495837.


If you are a little uncomfortable with glob here is what something without it might look like in this case.

In [68]:
for i in range(1,11):
    file_name = "large_data" + str(i) + ".txt"
    print(report_ltc(file_name,209384))

large_data1.txt contains 2081 intigers less than or equal to 209384.
large_data2.txt contains 2083 intigers less than or equal to 209384.
large_data3.txt contains 2081 intigers less than or equal to 209384.
large_data4.txt contains 2093 intigers less than or equal to 209384.
large_data5.txt contains 2135 intigers less than or equal to 209384.
large_data6.txt contains 2059 intigers less than or equal to 209384.
large_data7.txt contains 2046 intigers less than or equal to 209384.
large_data8.txt contains 2158 intigers less than or equal to 209384.
large_data9.txt contains 2119 intigers less than or equal to 209384.
large_data10.txt contains 2085 intigers less than or equal to 209384.
