## Pandas

Основные типы данных: DataFrame и Series. (https://khashtamov.com/ru/pandas-introduction/)

### Series 

объект, похожий на одномерный массив, но обладает ассоциированными метками (индексами, вдоль каждого элмента в списке). Это превращает его в ассоциативный массив или словарь в Питоне.

In [2]:
import pandas as pd

series = pd.Series([[1,2,3],[4,5,6]])
print(series)

0    [1, 2, 3]
1    [4, 5, 6]
dtype: object


In [3]:
series.index

RangeIndex(start=0, stop=2, step=1)

In [4]:
series.values

array([list([1, 2, 3]), list([4, 5, 6])], dtype=object)

In [5]:
series[0]

[1, 2, 3]

In [13]:
series1 = pd.Series([1,2,3,4,5], index = range(0,5))
print(series1)

0    1
1    2
2    3
3    4
4    5
dtype: int64


Выборка по нескольким индексам и групповое присваивание

In [14]:
series1[[1, 2, 3]]

1    2
2    3
3    4
dtype: int64

In [15]:
series1[[1, 2, 3]] = 0
print(series1)

0    1
1    0
2    0
3    0
4    5
dtype: int64


Фильтр данных

In [16]:
series1[series1 > 2]

4    5
dtype: int64

Работа со словарем

In [17]:
series2 = pd.Series({0: 1, 1: 2})
print(series2)

0    1
1    2
dtype: int64


In [20]:
2 in series2 #по ключам

False

Можно дать имя объекту и индексу через атрибут name

In [22]:
series2.name = 'objects'
series2.index.name = 'indexies'
series2

indexies
0    1
1    2
Name: objects, dtype: int64

### DataFrame

Можно описать, как обычную таблицу с соблюдением размерности

In [24]:
dataframe = pd.DataFrame({
    'languages': ['English', 'Russian'],
    'age': [18, 19]
})

print(dataframe)

  languages  age
0   English   18
1   Russian   19


Столбец в DF - это Series

In [25]:
dataframe['languages']

0    English
1    Russian
Name: languages, dtype: object

In [26]:
dataframe.columns

Index(['languages', 'age'], dtype='object')

In [27]:
dataframe.index

RangeIndex(start=0, stop=2, step=1)

## Параметры доступа

Индекс по строкам можно задать по разному: при формировании самого объекта DF или иначе

In [34]:
df = pd.DataFrame({
    'age': [12, 13, 15, 16],
    'number': [122, 555, 232, 444]
}, index = [1,2,3,4])

df

Unnamed: 0,age,number
1,12,122
2,13,555
3,15,232
4,16,444


In [36]:
df.index = [4,5,6,7]
df

Unnamed: 0,age,number
4,12,122
5,13,555
6,15,232
7,16,444


In [38]:
df.index.name = 'indexies'
df

Unnamed: 0_level_0,age,number
indexies,Unnamed: 1_level_1,Unnamed: 2_level_1
4,12,122
5,13,555
6,15,232
7,16,444


Доступ к строкам по индексу возможен несколькими способами:

.loc - используется для доступа по строковой метке

.iloc - используется для доступа по числовому значению (начиная от 0)

In [39]:
df.loc[4]

age        12
number    122
Name: 4, dtype: int64

In [41]:
df.iloc[0]

age        12
number    122
Name: 4, dtype: int64

In [43]:
df.loc[[4,5], 'age']

indexies
4    12
5    13
Name: age, dtype: int64

Создадим другой DF:

In [44]:
df = pd.DataFrame({
     'country': ['Kazakhstan', 'Russia', 'Belarus', 'Ukraine'],
     'population': [17.04, 143.5, 9.5, 45.5],
     'square': [2724902, 17125191, 207600, 603628]
 }, index=['KZ', 'RU', 'BY', 'UA'])

In [46]:
df[df.population > 10][['country', 'square']]

Unnamed: 0,country,square
KZ,Kazakhstan,2724902
RU,Russia,17125191
UA,Ukraine,603628


In [47]:
df.reset_index()

Unnamed: 0,index,country,population,square
0,KZ,Kazakhstan,17.04,2724902
1,RU,Russia,143.5,17125191
2,BY,Belarus,9.5,207600
3,UA,Ukraine,45.5,603628


Добавим плотность населения - еще один столбец

In [48]:
df['density'] = df['population'] / df['square'] * 1000000
df

Unnamed: 0,country,population,square,density
KZ,Kazakhstan,17.04,2724902,6.253436
RU,Russia,143.5,17125191,8.379469
BY,Belarus,9.5,207600,45.761079
UA,Ukraine,45.5,603628,75.37755


In [49]:
df.drop(['density'], axis = 'columns')

Unnamed: 0,country,population,square
KZ,Kazakhstan,17.04,2724902
RU,Russia,143.5,17125191
BY,Belarus,9.5,207600
UA,Ukraine,45.5,603628


In [50]:
df = df.rename(columns={'country':'Country'})
df

Unnamed: 0,Country,population,square,density
KZ,Kazakhstan,17.04,2724902,6.253436
RU,Russia,143.5,17125191,8.379469
BY,Belarus,9.5,207600,45.761079
UA,Ukraine,45.5,603628,75.37755


## Чтение и запись файлов

In [56]:
df1 = pd.read_csv('/content/tmdb_5000_credits.csv', sep=',')
df1

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...
4798,9367,El Mariachi,"[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4799,72766,Newlyweds,"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4800,231617,"Signed, Sealed, Delivered","[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4801,126186,Shanghai Calling,"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


### Группировка

In [57]:
print(df1.head())

   movie_id  ...                                               crew
0     19995  ...  [{"credit_id": "52fe48009251416c750aca23", "de...
1       285  ...  [{"credit_id": "52fe4232c3a36847f800b579", "de...
2    206647  ...  [{"credit_id": "54805967c3a36829b5002c41", "de...
3     49026  ...  [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4     49529  ...  [{"credit_id": "52fe479ac3a36847f813eaa3", "de...

[5 rows x 4 columns]


In [61]:
print(df1.groupby([df1.title])['movie_id'].count())

title
#Horror                       1
(500) Days of Summer          1
10 Cloverfield Lane           1
10 Days in a Madhouse         1
10 Things I Hate About You    1
                             ..
[REC]²                        1
eXistenZ                      1
xXx                           1
xXx: State of the Union       1
Æon Flux                      1
Name: movie_id, Length: 4800, dtype: int64
