## Оптимизация выполнения кода, векторизация, Numba

Материалы:
* Макрушин С.В. Лекция 3: Оптимизация выполнения кода, векторизация, Numba
* IPython Cookbook, Second Edition (2018), глава 4
* https://numba.pydata.org/numba-doc/latest/user/5minguide.html

In [None]:
import random 
import numpy as np
import pandas as pd
import numba 
from numba import njit, prange

In [None]:
!pip install line_profiler



## Задачи для совместного разбора

1. Сгенерируйте массив `A` из `N=1млн` случайных целых чисел на отрезке от 0 до 1000. Пусть `B[i] = A[i] + 100`. Посчитайте среднее значение массива `B`.

In [None]:
%time
A = np.random.randint(0, 1000, 1000000)
B = A + 100
np.mean(B)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.48 µs


599.903608

In [None]:
%time
A = np.random.uniform(0, 1000, 1000000)
B = A + 100
np.mean(B)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs


600.0420371156964

In [None]:
@njit
def exercise_1(n):
  A = [random.uniform(0, 1000) for _ in range(n)]
  B = [A[i] + 100 for i in range(len(A))]
  return np.sum(B) / np.array(B).shape[0]

In [None]:
%time
exercise_1(1000000)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs


600.0363992350647

2. Создайте таблицу 2млн строк и с 4 столбцами, заполненными случайными числами. Добавьте столбец `key`, которые содержит элементы из множества английских букв. Выберите из таблицы подмножество строк, для которых в столбце `key` указаны первые 5 английских букв.

In [None]:
def new_key(r):
  k = []
  for i in range(2000000):
    k.append(random.choice(r))
  return k

In [None]:
%time
df = pd.DataFrame()
arr =["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
df["1"] = np.random.randint(0, 1000, 2000000)
df["2"] = np.random.randint(0, 1000, 2000000)
df["3"] = np.random.randint(0, 1000, 2000000)
df["4"] = np.random.randint(0, 1000, 2000000)
df["key"] = np.array(new_key(arr))
arr_5 = ["a","b","c","d","e"]
df.loc[df['key'].isin(arr_5)]

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs


Unnamed: 0,1,2,3,4,key
0,935,61,618,650,e
5,700,109,469,889,c
7,499,304,416,792,c
9,877,361,415,654,a
12,41,752,249,172,a
...,...,...,...,...,...
1999980,42,66,710,35,b
1999985,121,219,944,328,e
1999986,846,905,523,235,e
1999994,522,984,236,57,d


## Лабораторная работа 3

# 1. В файлах `recipes_sample.csv` и `reviews_sample.csv` (__ЛР 2__) находится информация об рецептах блюд и отзывах на эти рецепты соответственно. Загрузите данные из файлов в виде `pd.DataFrame` с названиями `recipes` и `reviews`. Обратите внимание на корректное считывание столбца(ов) с индексами. Приведите столбцы к нужным типам.

Реализуйте несколько вариантов функции подсчета среднего значения столбца `rating` из таблицы `reviews` для отзывов, оставленных в 2010 году.

A. С использованием метода `DataFrame.iterrows` исходной таблицы;

Б. С использованием метода `DataFrame.iterrows` таблицы, в которой сохранены только отзывы за 2010 год;

В. С использованием метода `Series.mean`.

Проверьте, что результаты работы всех написанных функций корректны и совпадают. Измерьте выполнения всех написанных функций.


Какая из созданных функций выполняется медленнее? Что наиболее сильно влияет на скорость выполнения? Для ответа использовать профайлер line_profiler. Сохраните результаты работы профайлера в отдельную текстовую ячейку и прокомментируйте результаты его работы.
(*). Сможете ли вы ускорить работу функции 1Б, отказавшись от использования метода iterrows, но не используя метод mean?



In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/BNik2001/Big-data-processing-technologies.git
%cd /content/Big-data-processing-technologies/Pandas/sem/data
!ls

Cloning into 'Big-data-processing-technologies'...
remote: Enumerating objects: 92, done.[K
remote: Counting objects: 100% (92/92), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 92 (delta 18), reused 57 (delta 8), pack-reused 0[K
Unpacking objects: 100% (92/92), done.
/content/Big-data-processing-technologies/Pandas/sem/data
recipes_sample.csv  reviews_sample.csv	sp500hst.txt  sp_data2.csv


In [None]:
recipes = pd.read_csv('recipes_sample.csv', sep=',')
recipes.head(10)

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,
5,mennonite corn fritters,44045,15,41706,2002-10-25,,ok - my heritage has been revealed. :) these a...,
6,open sesame noodles,107229,28,173674,2004-12-30,8.0,this is a very versatile and widely enjoyed pa...,12.0
7,say what banana sandwich,95926,5,118163,2004-07-20,4.0,you just have to try it to believe it.,
8,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,12.0,this is the recipe that we use at my school ca...,11.0
9,412 broccoli casserole,306168,40,50969,2008-05-30,6.0,since there are already 411 recipes for brocco...,


In [None]:
reviews = pd.read_csv('reviews_sample.csv', sep=',', index_col=0)
reviews.head(10)

Unnamed: 0,user_id,recipe_id,date,rating,review
370476,21752,57993,2003-05-01,5,Last week whole sides of frozen salmon fillet ...
624300,431813,142201,2007-09-16,5,So simple and so tasty! I used a yellow capsi...
187037,400708,252013,2008-01-10,4,"Very nice breakfast HH, easy to make and yummy..."
706134,2001852463,404716,2017-12-11,5,These are a favorite for the holidays and so e...
312179,95810,129396,2008-03-14,5,Excellent soup! The tomato flavor is just gre...
910362,35106,31322,2003-01-03,4,I forgot to add skim milk but it still tasted ...
212649,404333,199579,2006-12-10,5,"Made this for dinner it was so excellent, fina..."
815389,162888,16067,2005-12-09,5,"When I snapped the picture, I forgot to review..."
642377,89831,33715,2007-07-03,5,This was good combination of flavors but I wil...
1023302,308434,11252,2008-12-14,5,Oh Bergy! These wonderful little cakes are aw...


## DataFrame.iterrows

In [None]:
reviews['date'] = pd.to_datetime(reviews['date'])
reviews['year'] = reviews['date'].dt.year 
reviews_2010 = reviews[reviews['year'] == 2010]


def simple_mean_rating_in_year(n_year=2010):
    count_rating, count_reviews = 0, 0

    for ind, line in reviews.iterrows():
        if line['date'].year != n_year:
            continue
        count_rating += line['rating']
        count_reviews += 1

    return count_rating / count_reviews

a = simple_mean_rating_in_year()

In [None]:
reviews['year']

370476     2003
624300     2007
187037     2008
706134     2017
312179     2008
           ... 
1013457    2009
158736     2012
1059834    2008
453285     2015
691207     2010
Name: year, Length: 126696, dtype: int64

## DataFrame.iterrows

In [None]:
reviews['date'] = pd.to_datetime(reviews['date'])
reviews['year'] = reviews['date'].dt.year 
reviews_2010 = reviews[reviews['year'] == 2010]

def smart_mean_rating_in_year(rev=reviews_2010):
    count_rating = sum(line['rating'] for ind, line in rev.iterrows())
    count_reviews = rev.shape[0]
    return count_rating / count_reviews

b = smart_mean_rating_in_year()

## Series.mean

In [None]:
reviews['date'] = pd.to_datetime(reviews['date'])
reviews['year'] = reviews['date'].dt.year 
reviews_2010 = reviews[reviews['year'] == 2010]


def smartest_mean_rating_in_year(n_year=2010):
    return reviews[reviews['date'].dt.year == n_year]['rating'].mean()

c = smartest_mean_rating_in_year()

## Проверка 

In [None]:
a == b == c

False

In [None]:
a, b, c

(4.4544402182900615, 4.4544402182900615, 4.4544402182900615)

In [None]:
%%timeit
simple_mean_rating_in_year()

1 loop, best of 5: 13.4 s per loop


In [None]:
%%timeit
smart_mean_rating_in_year()

1 loop, best of 5: 1.27 s per loop


In [None]:
%%timeit
smartest_mean_rating_in_year()

100 loops, best of 5: 18.4 ms per loop


# 3. Вам предлагается воспользоваться функцией, которая собирает статистику о том, сколько отзывов содержат то или иное слово. Измерьте время выполнения этой функции. Сможете ли вы найти узкие места в коде, используя профайлер? Выпишите (словами), что в имеющемся коде реализовано неоптимально. Оптимизируйте функцию и добейтесь значительного (как минимум, на один порядок) прироста в скорости выполнения.

In [None]:
%load_ext line_profiler

In [None]:
def get_word_reviews_count(df):
    word_reviews = {}
    for _, row in df.dropna(subset=['review']).iterrows():
        recipe_id, review = row['recipe_id'], row['review']
        words = review.split(' ')
        for word in words:
            if word not in word_reviews:
                word_reviews[word] = []
            word_reviews[word].append(recipe_id)
    
    word_reviews_count = {}
    for _, row in df.dropna(subset=['review']).iterrows():
        review = row['review']
        words = review.split(' ')
        for word in words:
            word_reviews_count[word] = len(word_reviews[word])
    return word_reviews_count



In [None]:
get_word_reviews_count(reviews_2010)

{'This': 3641,
 'soup': 469,
 'is': 5115,
 'so': 4585,
 'comforting.': 8,
 '': 21331,
 'I': 27983,
 'used': 4537,
 'low-sodium,': 1,
 'low-fat': 14,
 'chicken': 986,
 'stock.': 25,
 'butter.': 76,
 'dried': 244,
 'parsley.': 23,
 'So': 459,
 'easy': 1540,
 'to': 12576,
 'do': 670,
 'in': 6071,
 'less': 309,
 'than': 1019,
 'an': 989,
 'hour,': 22,
 "that's": 175,
 'great': 1802,
 ':)': 603,
 'DH': 372,
 'really': 1936,
 'loved': 1334,
 'it.': 1111,
 'Me': 9,
 'too': 1006,
 'of': 10667,
 'course.': 7,
 'Thanks': 2808,
 'Breezermom': 2,
 'Made': 1959,
 'for': 13061,
 'Market': 24,
 'tag': 137,
 'game': 80,
 "We're": 14,
 'addicted': 8,
 'this': 8273,
 'jam': 58,
 '--': 185,
 "it's": 524,
 'delicious!': 363,
 'tend': 23,
 'be': 2362,
 'vinegar-phobic': 1,
 'cut': 716,
 'that': 3609,
 'half;': 4,
 'we': 1210,
 'also': 1443,
 'adore': 9,
 'black': 181,
 'pepper,': 97,
 'increase': 56,
 'it': 10691,
 'by': 668,
 'half.': 62,
 'Also,': 148,
 'once': 147,
 'the': 26432,
 'strawberries': 61,
 '

In [None]:
%lprun -f get_word_reviews_count get_word_reviews_count(reviews_2010)

In [None]:
%timeit get_word_reviews_count(reviews_2010)

1 loop, best of 5: 3.56 s per loop


In [None]:
def get_word_reviews_count2(df):
    word_reviews = {}
    word_reviews_count = {}
    for _, row in df.dropna(subset=['review']).iterrows():
        recipe_id, review = row['recipe_id'], row['review']
        review = row['review']
        words = review.split(' ')
        for word in words:
            if word not in word_reviews:
                word_reviews[word] = []
            word_reviews[word].append(recipe_id)
            word_reviews_count[word] = len(word_reviews[word])
    return word_reviews_count

In [None]:
get_word_reviews_count2(reviews_2010)

{'This': 3641,
 'soup': 469,
 'is': 5115,
 'so': 4585,
 'comforting.': 8,
 '': 21331,
 'I': 27983,
 'used': 4537,
 'low-sodium,': 1,
 'low-fat': 14,
 'chicken': 986,
 'stock.': 25,
 'butter.': 76,
 'dried': 244,
 'parsley.': 23,
 'So': 459,
 'easy': 1540,
 'to': 12576,
 'do': 670,
 'in': 6071,
 'less': 309,
 'than': 1019,
 'an': 989,
 'hour,': 22,
 "that's": 175,
 'great': 1802,
 ':)': 603,
 'DH': 372,
 'really': 1936,
 'loved': 1334,
 'it.': 1111,
 'Me': 9,
 'too': 1006,
 'of': 10667,
 'course.': 7,
 'Thanks': 2808,
 'Breezermom': 2,
 'Made': 1959,
 'for': 13061,
 'Market': 24,
 'tag': 137,
 'game': 80,
 "We're": 14,
 'addicted': 8,
 'this': 8273,
 'jam': 58,
 '--': 185,
 "it's": 524,
 'delicious!': 363,
 'tend': 23,
 'be': 2362,
 'vinegar-phobic': 1,
 'cut': 716,
 'that': 3609,
 'half;': 4,
 'we': 1210,
 'also': 1443,
 'adore': 9,
 'black': 181,
 'pepper,': 97,
 'increase': 56,
 'it': 10691,
 'by': 668,
 'half.': 62,
 'Also,': 148,
 'once': 147,
 'the': 26432,
 'strawberries': 61,
 '

In [None]:
%lprun -f get_word_reviews_count2 get_word_reviews_count2(reviews_2010)

In [None]:
%timeit get_word_reviews_count2(reviews_2010)

1 loop, best of 5: 2.08 s per loop


4. Напишите несколько версий функции `MAPE` (см. [MAPE](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error)) для расчета среднего абсолютного процентного отклонения значения рейтинга отзыва на рецепт от среднего значения рейтинга по всем отзывам для этого рецепта. 
    1. Без использования векторизованных операций и методов массивов `numpy` и без использования `numba`
    2. Без использования векторизованных операций и методов массивов `numpy`, но с использованием `numba`
    3. С использованием векторизованных операций и методов массивов `numpy`, но без использования `numba`
    4. C использованием векторизованных операций и методов массивов `numpy` и `numba`
    
Измерьте время выполнения каждой из реализаций.

Замечание: удалите из выборки отзывы с нулевым рейтингом.


In [None]:
reviews_A = reviews[reviews['rating']>0].groupby(by = 'recipe_id').mean().reset_index()
reviews_Fs = reviews[reviews['rating']>0].groupby(by = 'recipe_id')['rating'].agg(list)
reviews_A['Fs'] = reviews_A['recipe_id'].map(reviews_Fs)
result = reviews_A.drop(['user_id'], axis = 1)
means = result['rating'].to_list()
fs = result['Fs'].to_list()
result2 = result[:10]
result2

Unnamed: 0,recipe_id,rating,year,Fs
0,48,2.0,2004.0,[2]
1,55,4.75,2007.75,"[5, 5, 4, 5]"
2,66,4.944444,2008.888889,"[5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ..."
3,91,4.75,2007.5,"[5, 5, 4, 5]"
4,94,5.0,2008.0,"[5, 5, 5, 5]"
5,128,5.0,2009.5,"[5, 5]"
6,153,4.935484,2006.612903,"[5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, ..."
7,176,5.0,2007.0,[5]
8,181,2.666667,2004.0,"[4, 1, 3]"
9,186,4.777778,2009.444444,"[5, 4, 5, 5, 5, 4, 5, 5, 5]"


ТУтааааааааааааааааааааааааааааааааааа

In [None]:
def recipe_id_rating(df,id):
  my_list= []
  id_stroka = df[df["recipe_id"] == id]
  actual = list(id_stroka['Fs'])[0]
  len_actual = len(actual)
  my_list.append((list(id_stroka['rating']))[0])
  predicted = my_list*len_actual
  return actual, predicted

In [None]:
from numba import jit
import numpy as np

    1. Без использования векторизованных операций и методов массивов `numpy` и без использования `numba`

In [None]:
def mape1(actual, predicted):
  minus_v = []
  for i in range(len(predicted)):
    minus_v.append(actual[i]-predicted[i])
  dvsn_v = []
  for i in range(len(predicted)):
    dvsn_v.append(minus_v[i]/predicted[i])
  abs_v = []
  for i in range(len(dvsn_v)):
    abs_v.append(dvsn_v[i]**2**0.5)
  sum_v = 0
  len_v = 0
  for i in range(len(abs_v)):
    sum_v += abs_v[i]
    len_v += 1
  mean_v = sum_v/len_v
  prcnt_v = mean_v * 100
  return prcnt_v

    2. Без использования векторизованных операций и методов массивов `numpy`, но с использованием `numba`

In [None]:
from numba import guvectorize, float64,void,float32
@guvectorize(['void(float64[:], float64[:], float64)'],'(n),(n)->()')
def mape2j(actual, predicted,prcnt_v):
  minus_v = []
  for i in range(len(predicted)):
    minus_v.append(actual[i]-predicted[i])
  dvsn_v = []
  for i in range(len(predicted)):
    dvsn_v.append(minus_v[i]/predicted[i])
  abs_v = []
  for i in range(len(dvsn_v)):
    abs_v.append(abs(dvsn_v[i]))
  sum_v = 0
  len_v = 0
  for i in range(len(abs_v)):
    sum_v += abs_v[i]
    len_v += 1
  mean_v = sum_v/len_v
  prcnt_v = mean_v * 100



    3. С использованием векторизованных операций и методов массивов `numpy`, но без использования `numba`

In [None]:
def mape3(actual, predicted):
  minus_v = np.array(actual) - predicted
  dvsn_v = minus_v / np.array(predicted)
  abs_v = np.abs(dvsn_v)
  mean_v = np.mean(abs_v)
  prcnt_v = mean_v * 100
  return prcnt_v
mape3 = np.vectorize(mape3,signature='(n),(n)->()')

    4. C использованием векторизованных операций и методов массивов `numpy` и `numba`

In [None]:
@guvectorize(['void(float64[:], float64[:], float64)'],'(n),(n)->()')
def mape4j(actual, predicted,prcnt_v):
  actual = np.array(actual)
  predicted = np.array(predicted)
  minus_v = actual - predicted
  dvsn_v = np.array(minus_v) / np.array(predicted)
  abs_v = np.abs(dvsn_v)
  mean_v = np.mean(abs_v)
  prcnt_v = mean_v * 100

Compilation is falling back to object mode WITHOUT looplifting enabled because Function "mape4j" failed type inference due to: No implementation of function Function(<built-in function array>) found for signature:
 
 >>> array(array(float64, 1d, A))
 
There are 2 candidate implementations:
   - Of which 2 did not match due to:
   Overload in function 'array': File: numba/core/typing/npydecl.py: Line 482.
     With argument(s): '(array(float64, 1d, A))':
    Rejected as the implementation raised a specific error:
      TypingError: array(float64, 1d, A) not allowed in a homogeneous sequence
  raised from /usr/local/lib/python3.7/dist-packages/numba/core/typing/npydecl.py:449

During: resolving callee type: Function(<built-in function array>)
During: typing of call at <ipython-input-7-a81ae5e4d31f> (3)


File "<ipython-input-7-a81ae5e4d31f>", line 3:
def mape4j(actual, predicted,prcnt_v):
  actual = np.array(actual)
  ^

  @guvectorize(['void(float64[:], float64[:], float64)'],'(n),(n)->

In [None]:
%%prun
mape1(recipe_id_rating(result2,66)[0],recipe_id_rating(result2,66)[1])

 

In [None]:
%%prun
mape2j(recipe_id_rating(result2,55)[0],recipe_id_rating(result2,55)[1])

 

In [None]:
%%prun
mape3(recipe_id_rating(result2,55)[0],recipe_id_rating(result2,55)[1])

 

In [None]:
%%prun
mape4j(recipe_id_rating(result2,55)[0],recipe_id_rating(result2,55)[1])

 

# Проверка

In [None]:
%%prun
mape1(np.random.rand(10000000),np.random.rand(10000000))

  # Remove the CWD from sys.path while we load stuff.


 

In [None]:
%%prun
mape2j(np.random.rand(10000000),np.random.rand(10000000))

 

In [None]:
%%prun
mape3(np.random.rand(10000000),np.random.rand(10000000))

 

In [None]:
%%prun
mape4j(np.random.rand(10000000),np.random.rand(10000000))

 