# Pandas

Материалы:
* Макрушин С.В. "Лекция 2: Библиотека Pandas"
* https://pandas.pydata.org/docs/user_guide/index.html#
* https://pandas.pydata.org/docs/reference/index.html
* Уэс Маккини. Python и анализ данных

## Задачи для совместного разбора

In [1]:
import pandas as pd
import numpy as np

1. Загрузите данные из файла `sp500hst.txt` и обозначьте столбцы в соответствии с содержимым: `"date", "ticker", "open", "high", "low", "close", "volume"`.

In [2]:
txt_data = pd.read_csv(
    'sp500hst.txt',
    names=['date', 'ticker', 'open', 'high', 'low', 'close', 'volume']
)
txt_data.date = pd.to_datetime(txt_data['date'], format='%Y%m%d')
txt_data

Unnamed: 0,date,ticker,open,high,low,close,volume
0,2009-08-21,A,25.60,25.6100,25.220,25.55,34758
1,2009-08-24,A,25.64,25.7400,25.330,25.50,22247
2,2009-08-25,A,25.50,25.7000,25.225,25.34,30891
3,2009-08-26,A,25.32,25.6425,25.145,25.48,33334
4,2009-08-27,A,25.50,25.5700,25.230,25.54,70176
...,...,...,...,...,...,...,...
122569,2010-08-13,ZMH,51.72,51.9000,51.380,51.44,14561
122570,2010-08-16,ZMH,51.13,51.4700,50.600,51.00,13489
122571,2010-08-17,ZMH,51.14,51.6000,50.890,51.21,20498
122572,2010-08-19,ZMH,51.63,51.6300,50.170,50.22,18259


2. Рассчитайте среднее значение показателей для каждого из столбцов c номерами 3-6.

In [3]:
txt_data.iloc[:,2:6].mean()

open     42.595458
high     43.102243
low      42.054464
close    42.601865
dtype: float64

3. Добавьте столбец, содержащий только число месяца, к которому относится дата.

In [4]:
txt_data['month'] = pd.DatetimeIndex(txt_data.date).month
txt_data

Unnamed: 0,date,ticker,open,high,low,close,volume,month
0,2009-08-21,A,25.60,25.6100,25.220,25.55,34758,8
1,2009-08-24,A,25.64,25.7400,25.330,25.50,22247,8
2,2009-08-25,A,25.50,25.7000,25.225,25.34,30891,8
3,2009-08-26,A,25.32,25.6425,25.145,25.48,33334,8
4,2009-08-27,A,25.50,25.5700,25.230,25.54,70176,8
...,...,...,...,...,...,...,...,...
122569,2010-08-13,ZMH,51.72,51.9000,51.380,51.44,14561,8
122570,2010-08-16,ZMH,51.13,51.4700,50.600,51.00,13489,8
122571,2010-08-17,ZMH,51.14,51.6000,50.890,51.21,20498,8
122572,2010-08-19,ZMH,51.63,51.6300,50.170,50.22,18259,8


4. Рассчитайте суммарный объем торгов для для одинаковых значений тикеров.

In [5]:
grouped = txt_data.groupby('ticker')
grouped.sum()[['volume']]

Unnamed: 0_level_0,volume
ticker,Unnamed: 1_level_1
A,8609336
AA,81898998
AAPL,52261170
ABC,9006756
ABT,18975870
...,...
XTO,21297931
YHOO,56837171
YUM,10971538
ZION,15551119


5. Загрузите данные из файла sp500hst.txt и обозначьте столбцы в соответствии с содержимым: "date", "ticker", "open", "high", "low", "close", "volume". Добавьте столбец с расшифровкой названия тикера, используя данные из файла `sp_data2.csv` . В случае нехватки данных об именах тикеров корректно обработать их.

In [6]:
names = pd.read_csv(
    'sp_data2.csv',
    delimiter=';',
    names=['short', 'fullname', 'percentage']
)
names

Unnamed: 0,short,fullname,percentage
0,AAPL,Apple,3.6%
1,AMZN,Amazon.com,3.2%
2,GOOGL,Alphabet,3.1%
3,GOOG,Alphabet,3.1%
4,MSFT,Microsoft,3.0%
...,...,...,...
500,SCG,SCANA,0.0%
501,AIZ,Assurant,0.0%
502,AYI,Acuity Brands,0.0%
503,HRB,H&R Block,0.0%


In [7]:
txt_data.merge(
    names.iloc[:,:2],
    how='left',
    left_on='ticker',
    right_on='short'
).drop(columns='short')

Unnamed: 0,date,ticker,open,high,low,close,volume,month,fullname
0,2009-08-21,A,25.60,25.6100,25.220,25.55,34758,8,Agilent Technologies
1,2009-08-24,A,25.64,25.7400,25.330,25.50,22247,8,Agilent Technologies
2,2009-08-25,A,25.50,25.7000,25.225,25.34,30891,8,Agilent Technologies
3,2009-08-26,A,25.32,25.6425,25.145,25.48,33334,8,Agilent Technologies
4,2009-08-27,A,25.50,25.5700,25.230,25.54,70176,8,Agilent Technologies
...,...,...,...,...,...,...,...,...,...
122569,2010-08-13,ZMH,51.72,51.9000,51.380,51.44,14561,8,
122570,2010-08-16,ZMH,51.13,51.4700,50.600,51.00,13489,8,
122571,2010-08-17,ZMH,51.14,51.6000,50.890,51.21,20498,8,
122572,2010-08-19,ZMH,51.63,51.6300,50.170,50.22,18259,8,


## Лабораторная работа №2

### Базовые операции с `DataFrame`

1.1 В файлах `recipes_sample.csv` и `reviews_sample.csv` находится информация об рецептах блюд и отзывах на эти рецепты соответственно. Загрузите данные из файлов в виде `pd.DataFrame` с названиями `recipes` и `reviews`. Обратите внимание на корректное считывание столбца с индексами в таблице `reviews` (безымянный столбец).

In [8]:
recipes = pd.read_csv(
    'recipes_sample.csv'
)
recipes.submitted = pd.to_datetime(recipes.submitted, format='%Y-%m-%d')
recipes

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,
...,...,...,...,...,...,...,...,...
29995,zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0
29996,zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,,"this is a traditional fresh plum cake, thought...",11.0
29997,zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,,this is a traditional late summer early fall s...,
29998,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,


In [9]:
reviews = pd.read_csv(
    'reviews_sample.csv'
)
reviews

Unnamed: 0.1,Unnamed: 0,user_id,recipe_id,date,rating,review
0,370476,21752,57993,2003-05-01,5,Last week whole sides of frozen salmon fillet ...
1,624300,431813,142201,2007-09-16,5,So simple and so tasty! I used a yellow capsi...
2,187037,400708,252013,2008-01-10,4,"Very nice breakfast HH, easy to make and yummy..."
3,706134,2001852463,404716,2017-12-11,5,These are a favorite for the holidays and so e...
4,312179,95810,129396,2008-03-14,5,Excellent soup! The tomato flavor is just gre...
...,...,...,...,...,...,...
126691,1013457,1270706,335534,2009-05-17,4,This recipe was great! I made it last night. I...
126692,158736,2282344,8701,2012-06-03,0,This recipe is outstanding. I followed the rec...
126693,1059834,689540,222001,2008-04-08,5,"Well, we were not a crowd but it was a fabulou..."
126694,453285,2000242659,354979,2015-06-02,5,I have been a steak eater and dedicated BBQ gr...


1.2 Для каждой из таблиц выведите основные параметры:
* количество точек данных (строк);
* количество столбцов;
* тип данных каждого столбца.

1.3 Исследуйте, в каких столбцах таблиц содержатся пропуски. Посчитайте долю строк, содержащих пропуски, в отношении к общему количеству строк.

1.4 Рассчитайте среднее значение для каждого из числовых столбцов (где это имеет смысл).

1.5 Создайте серию из 10 случайных названий рецептов.

1.6 Измените индекс в таблице `reviews`, пронумеровав строки, начиная с нуля.

1.7 Выведите информацию о рецептах, время выполнения которых не больше 20 минут и кол-во ингредиентов в которых не больше 5.

### Работа с датами в `pandas`

2.1 Преобразуйте столбец `submitted` из таблицы `recipes` в формат времени. Модифицируйте решение задачи 1.1 так, чтобы считать столбец сразу в нужном формате.

2.2 Выведите информацию о рецептах, добавленных в датасет не позже 2010 года.

### Работа со строковыми данными в `pandas`

3.1  Добавьте в таблицу `recipes` столбец `description_length`, в котором хранится длина описания рецепта из столбца `description`.

In [52]:
recipes['description_length'] = recipes.description.str.len()
recipes

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,description_length
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,330.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,255.0
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0,39.0
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,,154.0
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,587.0
...,...,...,...,...,...,...,...,...,...
29995,zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0,484.0
29996,zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,,"this is a traditional fresh plum cake, thought...",11.0,286.0
29997,zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,,this is a traditional late summer early fall s...,,311.0
29998,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,648.0


3.2 Измените название каждого рецепта в таблице `recipes` таким образом, чтобы каждое слово в названии начиналось с прописной буквы.

In [61]:
recipes.name = recipes.name.str.capitalize()
recipes

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,description_length
0,George s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,330.0
1,Healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,255.0
2,I can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0,39.0
3,Italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,,154.0
4,Love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,587.0
...,...,...,...,...,...,...,...,...,...
29995,Zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0,484.0
29996,Zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,,"this is a traditional fresh plum cake, thought...",11.0,286.0
29997,Zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,,this is a traditional late summer early fall s...,,311.0
29998,Zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,648.0


3.3 Добавьте в таблицу `recipes` столбец `name_word_count`, в котором хранится количество слов из названии рецепта (считайте, что слова в названии разделяются только пробелами). Обратите внимание, что между словами может располагаться несколько пробелов подряд.

In [71]:
recipes['name_word_count'] = recipes.name.apply(lambda x: len(x.split()))
recipes

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,description_length,name_word_count
0,George s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,330.0,8
1,Healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,255.0,5
2,I can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0,39.0,7
3,Italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,,154.0,3
4,Love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,587.0,8
...,...,...,...,...,...,...,...,...,...,...
29995,Zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0,484.0,8
29996,Zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,,"this is a traditional fresh plum cake, thought...",11.0,286.0,4
29997,Zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,,this is a traditional late summer early fall s...,,311.0,5
29998,Zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,648.0,2


### Группировки таблиц `pd.DataFrame`

4.1 Посчитайте количество рецептов, представленных каждым из участников (`contributor_id`). Какой участник добавил максимальное кол-во рецептов?

In [10]:
grouped = recipes.groupby('contributor_id')
count_contr = grouped['name'].count()
print(count_contr)
print('\n\n\nMax_contr_id', count_contr.idxmax())


contributor_id
1530            5
1533          186
1534           50
1535           40
1538            8
             ... 
2001968497      2
2002059754      1
2002234079      1
2002234259      1
2002247884      1
Name: name, Length: 8404, dtype: int64



Max_contr_id 89831


4.2 Посчитайте средний рейтинг к каждому из рецептов. Для скольких рецептов отсутствуют отзывы? Обратите внимание, что отзыв с нулевым рейтингом или не заполненным текстовым описанием не считается отсутствующим.

In [11]:
left_join = recipes.merge(
    reviews,
    how='left',
    left_on='id',
    right_on='recipe_id'
).drop(columns='recipe_id')
left_join

Unnamed: 0.1,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,Unnamed: 0,user_id,date,rating,review
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,920960.0,743566.0,2008-01-28,5.0,I lived in San Diego for 19 years and would g...
1,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,920958.0,76503.0,2003-06-03,5.0,This soup is the Bomb! Don't hesitate to try.....
2,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,920957.0,34206.0,2003-03-23,5.0,I just can't say enough about how wonderful th...
3,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,897580.0,494084.0,2012-09-26,5.0,These are great! I use 100% (organic) juice a...
4,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,897579.0,303445.0,2012-03-31,5.0,"Very, very good. My son loves these. He like..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
128591,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,493812.0,305531.0,2013-07-18,5.0,Delish! I made this as directed but used a smo...
128592,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,493330.0,1271506.0,2012-09-10,5.0,Now the only substitution I made was African B...
128593,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,493813.0,724631.0,2014-01-07,5.0,"Very tasty soup, moderate spiciness (even afte..."
128594,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,493811.0,133174.0,2013-07-18,5.0,Very yummy indeed. A spicy sausage was used i...


In [12]:
grouped_rec=left_join.groupby('id')
grouped_rec['rating'].mean()

id
48        1.000000
55        4.750000
66        4.944444
91        4.750000
94        5.000000
            ...   
536547    5.000000
536610    0.000000
536728    4.000000
536729    4.750000
536747    0.000000
Name: rating, Length: 30000, dtype: float64

Для скольких рецептов отсутствуют отзывы

In [13]:
no_rewiev = grouped_rec.count()[grouped_rec['rating'].count()==0].index
no_rewiev

Int64Index([  1144,   2691,   2759,   2994,   3145,   3162,   3363,   3492,
              4000,   4740,
            ...
            523406, 524235, 524501, 529128, 530211, 531001, 531398, 533052,
            533867, 534760],
           dtype='int64', name='id', length=1900)

In [14]:
len(no_rewiev)

1900

4.3 Посчитайте количество рецептов с разбивкой по годам создания.

In [15]:
left_join['year'] = pd.DatetimeIndex(left_join.submitted).year
grouped_year = left_join.groupby('year')
grouped_year['id'].count()

year
1999     1436
2000      512
2001     5049
2002    21571
2003    16149
2004    13805
2005    16938
2006    12123
2007    14203
2008    11113
2009     7291
2010     3375
2011     1845
2012     1328
2013     1305
2014      352
2015       88
2016       27
2017       50
2018       36
Name: id, dtype: int64

### Объединение таблиц `pd.DataFrame`

5.1 При помощи объединения таблиц, создайте `DataFrame`, состоящий из четырех столбцов: `id`, `name`, `user_id`, `rating`. Рецепты, на которые не оставлен ни один отзыв, должны отсутствовать в полученной таблице. Подтвердите правильность работы вашего кода, выбрав рецепт, не имеющий отзывов, и попытавшись найти строку, соответствующую этому рецепту, в полученном `DataFrame`.

In [16]:
df_51 = recipes[['id','name']].merge(
    reviews[['user_id','rating','recipe_id']],
    how='left',
    left_on='id',
    right_on='recipe_id'
).drop(columns='recipe_id')
without_nan = df_51.dropna(subset=['rating'])
without_nan

Unnamed: 0,id,name,user_id,rating
0,44123,george s at the cove black bean soup,743566.0,5.0
1,44123,george s at the cove black bean soup,76503.0,5.0
2,44123,george s at the cove black bean soup,34206.0,5.0
3,67664,healthy for them yogurt popsicles,494084.0,5.0
4,67664,healthy for them yogurt popsicles,303445.0,5.0
...,...,...,...,...
128591,486161,zydeco soup,305531.0,5.0
128592,486161,zydeco soup,1271506.0,5.0
128593,486161,zydeco soup,724631.0,5.0
128594,486161,zydeco soup,133174.0,5.0


In [17]:
without_nan[without_nan['id']==1144]

Unnamed: 0,id,name,user_id,rating


In [18]:
df_51[df_51['id']==1144]

Unnamed: 0,id,name,user_id,rating
109306,1144,steak tomato basil pasta,,


5.2 При помощи объединения таблиц и группировок, создайте `DataFrame`, состоящий из трех столбцов: `recipe_id`, `name`, `review_count`, где столбец `review_count` содержит кол-во отзывов, оставленных на рецепт `recipe_id`. У рецептов, на которые не оставлен ни один отзыв, в столбце `review_count` должен быть указан 0. Подтвердите правильность работы вашего кода, выбрав рецепт, не имеющий отзывов, и найдя строку, соответствующую этому рецепту, в полученном `DataFrame`.

In [37]:
grouped52 = df_51.drop(columns='user_id').groupby(['id','name']).count().reset_index()
grouped52.columns = ['recipe_id', 'name', 'rewiev_count']
grouped52

Unnamed: 0,recipe_id,name,rewiev_count
0,48,boston cream pie,2
1,55,betty crocker s southwestern guacamole dip,4
2,66,black coffee barbecue sauce,18
3,91,brown rice and vegetable pilaf,4
4,94,blueberry buttertarts,4
...,...,...,...
29995,536547,cauliflower ceviche,1
29996,536610,miracle home made puff pastry,1
29997,536728,gluten free vegemite,1
29998,536729,creole watermelon feta salad,4


In [38]:
grouped52[grouped52['recipe_id']==1144]

Unnamed: 0,recipe_id,name,rewiev_count
61,1144,steak tomato basil pasta,0


5.3. Выясните, рецепты, добавленные в каком году, имеют наименьший средний рейтинг?

In [45]:
grouped53 = left_join[['year', 'rating']].groupby('year')
grouped53.mean().idxmin()

rating    2017
dtype: int64

### Сохранение таблиц `pd.DataFrame`

6.1 Отсортируйте таблицу в порядке убывания величины столбца `name_word_count` и сохраните результаты выполнения заданий 3.1-3.3 в csv файл. 

In [75]:
df_61 = recipes.sort_values(by='name_word_count', ascending=False)
df_61

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,description_length,name_word_count
26223,Subru uncle s whole green moong dal i ll be ma...,77188,95,6357,2003-11-21,,my dad and mom quite enjoy this lentil curry. ...,15.0,343.0,15
28083,Tsr version of t g i friday s black bean soup...,102274,75,74652,2004-10-19,9.0,from www.topsecretrecipes.com i got this copyc...,16.0,436.0,14
26222,Subru uncle s toor ki dal sindhi style dad m...,76908,65,6357,2003-11-18,29.0,this is the lentil curry that subru uncle(our ...,15.0,1087.0,14
27876,Top secret recipes version of i h o p griddl...,113346,20,175727,2005-03-14,5.0,this recipe is top secret recipes version of i...,9.0,129.0,14
5734,Chicken curry or cat s vomit on a bed of magg...,294898,30,802799,2008-03-28,11.0,an old family recipe that's easy to make since...,12.0,144.0,13
...,...,...,...,...,...,...,...,...,...,...
3253,Blackmoons,323195,430,415934,2008-09-04,5.0,my mom was a newlywed in the 1950s when she fo...,,389.0,1
4138,Bushwhacker,156521,10,177392,2006-02-17,1.0,this drink is an excellent after dinner drink ...,6.0,124.0,1
2357,Basbousa,12957,60,18391,2001-10-20,,this is a traditional middle eastern dessert. ...,,78.0,1
15052,Josefinas,264859,20,498271,2007-11-11,7.0,"from the junior league of corpus christi tx, t...",,92.0,1


In [76]:
df_61.to_csv('df61.csv')

6.2 Воспользовавшись `pd.ExcelWriter`, cохраните результаты 5.1 и 5.2 в файл: на лист с названием `Рецепты с оценками` сохраните результаты выполнения 5.1; на лист с названием `Количество отзывов по рецептам` сохраните результаты выполнения 5.2.

In [77]:
with pd.ExcelWriter("Excel51_52.xlsx") as writer:
    without_nan.to_excel(writer, sheet_name="Рецепты с оценками")  
    grouped52.to_excel(writer, sheet_name="Количество отзывов по рецептам")  

#### [версия 2]
* Уточнены формулировки задач 1.1, 3.3, 4.2, 5.1, 5.2, 5.3