In [1]:
import numpy as np
import time

### In this exercise we expect you to demonstrate your ability to / knowledge of:
- optimization
- error debugging
- Performance improvement
- OOPS
- Git Actions

### Improve the efficiency of following code

In [2]:
starttime1=time.time()
total = 0
for i in np.arange(100000):
    total = i + total
endtime1=time.time()
f_time=endtime1-starttime1
print(total)
print(f'for loop method took {f_time}')

704982704
for loop method took 0.028990983963012695


  total = i + total


Optimized Code

In [3]:
#Np.sum can handle in optimized manner

starttime1=time.time()

total_v2 = 0
n = 100000
total_v2 = np.sum(np.arange(n))
endtime1=time.time()

f_time=endtime1-starttime1
print(f'for loop method took {f_time}')
print(total_v2)

for loop method took 0.01201486587524414
704982704


### idenitify code issue

In [4]:
import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

In [5]:
#whats woring with following code->goal is to create new column and set the value for 'Paraguay' to 10
c[c.Country=='Paraguay']['new_col']=10


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  c[c.Country=='Paraguay']['new_col']=10


Solution Below

In [6]:
#Code handles above error i.e create col: new_col which has value: 10 for 'Country' column having value: Paraguay, rest of rows of new_col would have nulls
c.loc[c['Country'] == 'Paraguay', 'new_col'] = 10

In [7]:
c[c.Country=='Paraguay'].head()

Unnamed: 0,Country,Region,new_col
189,Paraguay,SOUTH AMERICA,10.0


In [8]:
c.head()

Unnamed: 0,Country,Region,new_col
0,Algeria,AFRICA,
1,Angola,AFRICA,
2,Benin,AFRICA,
3,Botswana,AFRICA,
4,Burkina,AFRICA,


### Parllel processing

Normalize each row of 2d array (list) to vary between 0 and 1

make sure code can execute on multiple cpu's

input:[[2, 3, 4, 5], [6, 9, 10, 12], [11, 12, 13, 14], [21, 24, 25, 26]]

Output:[[0.0, 0.3333333333333333, 0.6666666666666666, 1.0], [0.0, 0.5, 0.6666666666666666, 1.0], [0.0, 0.3333333333333333, 0.6666666666666666, 1.0], [0.0, 0.6, 0.8, 1.0]]

Solution

In [9]:
import numpy as np

# Sample 2D array of lists
data = np.array([[2, 3, 4, 5], [6, 9, 10, 12], [11, 12, 13, 14], [21, 24, 25, 26]])

# Min-Max scaling to normalize the data between 0 and 1
min_vals = np.min(data, axis=1, keepdims=True)
max_vals = np.max(data, axis=1, keepdims=True)

normalized_data = (data - min_vals) / (max_vals - min_vals)

print(normalized_data.tolist())

[[0.0, 0.3333333333333333, 0.6666666666666666, 1.0], [0.0, 0.5, 0.6666666666666666, 1.0], [0.0, 0.3333333333333333, 0.6666666666666666, 1.0], [0.0, 0.6, 0.8, 1.0]]


With Multiple CPUs

In [20]:
data = [[2, 3, 4, 5], [6, 9, 10, 12], [11, 12, 13, 14], [21, 24, 25, 26]]

import numpy as np
import multiprocessing as mp

def normalize_row(row):
    min_val = min(row, axis=1, keepdims=True)
    max_val = max(row, axis=1, keepdims=True)
    return [(val - min_val) / (max_val - min_val) for val in row]

# Normalize each row of the 2D array between 0 and 1 using multiple CPUs
def normalize_2d_array(data):
    num_processes = mp.cpu_count()
    pool = mp.Pool(processes=num_processes)
    normalized_data = pool.map(normalize_row, data)
    pool.close()
    pool.join()
    return normalized_data

In [21]:
output_data = normalize_2d_array(data)

print(output_data)

### OOPS-github actions

Following is the table structure

CREATE TABLE author(
    A_ID int NOT NULL,
    Name varchar(100),
    PRIMARY KEY(A_ID )
)

CREATE TABLE books(
   B_ID int NOT NULL PRIMARY KEY,
   Name varchar(100),
   Price int NOT NULL,
   A_ID int FOREIGN KEY REFERENCES author(A_ID)
); 


- Program should read ddl statements and parse column names, data types and length of the columns and constraints

- Program should generate required number of rows for parent and child tables  (can be taken as a parameter)

- Generated data of both tables should follow the normalization rules and also foreign key constraints and data types

- write approriate test cases wtih python test framework(unit test or pytest)

- code should be committed to Git only when all the testcases are passd (use git actions) , this step acts as a ci/cd step.

- Code should be using oops concets and pep 8 standards with proper documentation

- result data should be saved in parquet format.

Solution below:

I have attached following three files along with this jupyter notebook as part of email, Please find below files with description for each of them.

1) ds_exercise_final.py - Final code to generate table, add data, get data, write to the parquet format.
2) test_file.py - code for pytest test cases.
3) pythonapp.yml - yaml file for github actions to check if the code is good for commit. (tested on github)
