#### **Killer functions**

Some functions can help to build real-world applicable machine learning models to leverage optimized, efficient and accurate methods at scalem

##### **Input/Output**

Input and output operations are serialized using Pandas methods, which means that Python objects are trasformed in order to be stored or uploaded, making them highly inefficient and time-consuming. 

So, to overcome this issue is possible to use different techniques:
1. If the same file is loaded multiple times, possibly in the same pipeline or after reloading the kernel, load data only once and save it into a pickle, parquet of feather oject.

In [65]:
import pandas as pd
import numpy as np

import datatable as dt 

In [58]:
dd = {}

for i in range(30):
    if i < 10:
        dd[f'Zeros' + str(i)] = np.zeros(10000000).astype(str)
    if (i >= 10) & (i < 20):
        dd[f'Zeros' + str(i)] = np.zeros(10000000).astype(int)
    if (i >= 20) & (i < 30):
        dd[f'Zeros' + str(i)] = np.zeros(10000000).astype(float)        

In [59]:
pd.DataFrame(dd).to_csv('Zeros.csv')

In [60]:
PATH = 'Zeros.csv'

data = pd.read_csv(PATH)

# To Pickle.
data.to_pickle(PATH[:-3] + 'pickle')

# To Parquet.
data.to_parquet(PATH[:-3] + 'parquet', engine = 'pyarrow')

# To Feather.
data.to_feather(PATH[:-3] + 'feather')

In [61]:
%timeit for _ in range(1000): True

_ = pd.read_csv(PATH)

26.5 µs ± 3.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [62]:
%timeit for _ in range(1000): True

# maybe, there is no difference because the size of the dataset is not huge enough. 
_ = pd.read_pickle(PATH[:-3] + 'pickle')

20.7 µs ± 2.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [63]:
%timeit for _ in range(1000): True

_ = pd.read_parquet(PATH[:-3] + 'parquet')

18 µs ± 542 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [64]:
%timeit for _ in range(1000): True

_ = pd.read_feather(PATH[:-3] + 'feather')

19.4 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


2. If the file is imported only once, use datatable.

In [66]:
dt_df = dt.fread(PATH)
pd_df = dt_df.to_pandas()

##### **Filter dataset on categorical variables**

In [68]:
import random

In [69]:
city_list = ["New York", "Manchester", "California", "Munich", "Bombay", 
             "Sydeny", "London", "Moscow", "Dubai", "Tokyo"]

job_list = ["Software Development Engineer", "Research Engineer", 
            "Test Engineer", "Software Development Engineer-II", 
            "Python Developer", "Back End Developer", 
            "Front End Developer", "Data Scientist", 
            "IOS Developer", "Android Developer"]

cmp_list = ["Amazon", "Google", "Infosys", "Mastercard", "Microsoft", 
            "Uber", "IBM", "Apple", "Wipro", "Cognizant"]

data = []
for i in range(4_096_000):
  
    company = random.choice(cmp_list)
    job = random.choice(job_list)
    city = random.choice(city_list)
    salary = int(round(np.random.rand(), 3)*10**6)
    employment = random.choices(["Full Time", "Intern"], weights=(80, 20))[0]
    rating = round((np.random.rand()*5), 1)
    
    data.append([company, job, city, salary, employment, rating])
    
data = pd.DataFrame(data, columns=["Company Name", "Employee Job Title",
                                   "Employee Work Location",  "Employee Salary", 
                                   "Employment Status", "Employee Rating"])

In [74]:
data.head()

Unnamed: 0,Company Name,Employee Job Title,Employee Work Location,Employee Salary,Employment Status,Employee Rating
0,Cognizant,Back End Developer,London,526000,Full Time,3.7
1,Infosys,Front End Developer,Dubai,109000,Full Time,2.2
2,IBM,Front End Developer,London,288000,Full Time,2.9
3,Wipro,Research Engineer,London,214000,Full Time,4.7
4,Uber,Test Engineer,Tokyo,87000,Intern,1.3


In [80]:
%%timeit for i in range(1000): True 

data[data['Company Name'] == 'Amazon']

265 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [84]:
%%timeit for i in range(1000): True

data.groupby('Company Name').get_group("Amazon")

404 ms ± 5.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


##### **Merging dataframe**

In [85]:
df1 = pd.DataFrame([["A", 1], ["B", 2]], columns = ["col_a", "col_b"])
df2 = pd.DataFrame([["A", 3], ["B", 4]], columns = ["col_a", "col_c"])

In [87]:
%%timeit for i in range(1000): True 

pd.merge(df1, df2, on = "col_a", how = "inner")

1.24 ms ± 64.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [89]:
df1.set_index("col_a", inplace=True)
df2.set_index("col_a", inplace=True)

In [90]:
%%timeit for i in range(1000): True 

df1.join(df2)

244 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
