# Exploratory data analysis

in this lesson we'll look at
- pivot tables
- query()
- create graf
- work with date and time

load the libraries known to us and also a new dataset (https://www.kaggle.com/datasets/gregorut/videogamesales?resource=download)

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('./datasets/vgsales.csv')

In [None]:
df.head()

# Pivot tables for calculation

method for building pivot tables - `pivot_table()` Recall its parameters:
- `index` - a column whose values become <font color="#2e0a8f">row</font> names (index);
- `columns` - a column whose values become the names of the <font color="#f00505">columns</font>;
- `values` - <font color="#29ab33">values</font> by which you want to see the pivot table;
- `aggfunc` - function applied to values

![jupyter](./pict/pivott.png)

the values of the aggregating function can be different, for this we will additionally help the new `numpy` library

In [None]:
import numpy as np

In [None]:
df.head()

highlight the most popular functions

In [None]:
table = pd.pivot_table(df.loc[sdfsdfsfsdfsdfsdfsdfsdfsfsdfsdf], 
                       index=['Genre'], 
                       values=['NA_Sales','Global_Sales'],
           aggfunc={'Global_Sales':np.sum,'NA_Sales':[np.sum, np.mean, len, np.min, np.max, 'count', np.std]},
                       fill_value=0)
table

building pivot tables, as you noticed, similarly to excel, we select `index` and `columns`, as well as prescribe aggregate functions for `values`

if we need to specify several aggregations for one value, the recording occurs similarly to an array

if we use a single aggregation for all fields, the record is simplified

In [None]:
table = pd.pivot_table(df, 
                       index=['Genre'], 
                       values=['NA_Sales','EU_Sales','JP_Sales','Other_Sales'],
                       aggfunc=np.sum,
                       fill_value=0)
table

In [None]:
df_table = table.reset_index()

In [None]:
df_table

# Histogram

A histogram is a graph that shows how often a particular value occurs in a dataset. The histogram combines numerical values ​​into ranges, that is, it counts the frequency of values ​​within each interval.

Compare the histogram with the result of the value_counts() method using the Simpsons data as an example

In [None]:
simpsons = pd.read_csv('./datasets/SimpsonsData.csv')
simpsons.head()

The dataset contains information about The Simpsons episodes: title, release date, season, season number, and rating. Should the authors and producers of the series prepare a new season?
Help the team make a decision. Find the episode rating of the series in the past.

Using the `value_counts()` method, you can display the number of series that received a particular rating:

In [None]:
simpsons['Rating'].value_counts().head(10)

This presentation of the data will in no way convince the producers to continue working on The Simpsons. To see which ratings the show received most often, you can build a histogram:

In [None]:
simpsons['Rating'].hist()

Most often, episodes received ratings in the region of 7 points or higher. This is a very good indicator for the series.

In pandas, a histogram is built by a special method `hist()`. It can be applied to a list or to a dataframe column: in the second case, the name of the column is passed in the parameter. The `hist()` method finds the minimum and maximum values in a set of numbers, and divides the resulting range into areas, or baskets. `hist()` then counts how many values are in each bin and plots that.

The `bins` parameter determines how many divisions to divide the data range into. There are 10 such baskets by default.

Let's build a histogram showing the number of balls in a bowling alley. Let's say we have one ball of each number from 6 to 16:

In [None]:
pd.Series([6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]).hist()

Although there is one ball for each number, the histogram is not like a rectangle. This is because the default parameter is `bins=10`, and the number of balls is 11. Let's pass in the corresponding number of bins and take a look at the resulting graph:

In [None]:
pd.Series([6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]).hist(bins=11) 

more example

In [None]:
pd.Series([6, 8, 8, 9, 10, 11, 12, 13, 14, 15, 16]).hist(bins=6) 

In [None]:
pd.Series([6, 8, 8, 9, 10, 11, 12, 13, 14, 100]).hist(bins=100)

Nine balls with values ​​in the range from 5 to 15 and one ball with a number from 90 to 100. With this image, the subtle features of the distribution of values ​​in the range from 5 to 15 are not visible - that there is no seven in it, but eights are a pair.
Let's return the detail by increasing the number of baskets to 100.

In [None]:
pd.Series([6, 8, 8, 9, 10, 11, 12, 13, 14, 100]).hist(bins=100) 

Let's change the scale manually by specifying the range of values ​​by which the graph should be built. The boundaries of the interval of interest are indicated in the `range` parameter: `range=(min_value, max_value)`. Need an area from 6 to 14:

In [None]:
pd.Series([6, 8, 8, 9, 10, 11, 12, 13, 14, 100]).hist(range = (6, 14, 2)) 

In [None]:
pd.Series([0, 0, 0, 0, 0, 10, 10, 10, 10, 10]).hist(range=(0, 10)) 

In [None]:
pd.Series([4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6]).hist(range=(0, 10)) 

Looking only at the average of these datasets (listed above), we can say that they are very similar. However, the histograms describing them make it clear that these are two very different phenomena.

# New experiment

let's try to solve the problem - we have two cubes, we will subtract them in our mind, what value will we get? What is the value of rolling the die 1000 times?

![jupyter](./pict/cub.png)

get acquainted with the new `random` library

In [None]:
import random 

The `random.randint()` function returns random integers. It takes two arguments: the smallest and largest valid number.

In [None]:
print(random.randint(1, 6))

Let's write a function that returns a random number of points on the top face:

In [None]:
def dice_roll():
    score = random.randint(1, 6)  
    return score

We have a couple of cubes. So, we need a function that gets the number of points from the roll of two dice:

In [None]:
def double_roll_score():
    first = dice_roll()
    second = dice_roll()
    score = first + second
    return score

In [None]:
df_experiments

Let's make 1000 such throws and build a histogram of the points received:

In [None]:
experiments = []
for i in range(1000):
    score = double_roll_score()
    experiments.append(score)

df_experiments = pd.DataFrame({'cube' :experiments})
ar = df_experiments.hist(bins=11, range=(2, 12))

Interestingly, such a histogram describes only dice rolls or is it characteristic of other phenomena?

The most common (typical, normal) values fall in the middle. And rare - at the edges. The graph is symmetrical and resembles a bell. Such a distribution is called normal.

![jupyter](./pict/normal.png)

Normal or gravitating to them distributions are common in life. This is how the height of people, the size of apples, the results of temperature measurements are distributed. Understanding the nature of distributions is necessary to detect important anomalies.

In general, any deviation from the expected distribution is a signal that the data is out of order.
Another distribution that is often encountered is the Poisson distribution. It describes the number of events per unit of time.

Describing the distribution, analysts calculate the arithmetic `mean` or `median`. However, in addition to the `median` and `mean`, it is important to know the characteristic spread - which values are far from the mean and how many there are.

A much more stable estimate is the interquartile range.
Quartiles break an ordered data set into four parts. The first quartile `Q1` is the number separating the first quarter of the sample: 25% of the elements are less than it, and 75% are more than it. The median is the second quartile of `Q2`, half of the items are greater than and half are less than. The third quartile of `Q3` is a cutoff of three quarters: 75% of the elements are less than and 25% of the elements are more than it. The interquartile range is the distance between `Q1` and `Q3`.

![jupyter](./pict/quart.png)

In Python, a range chart is built using the `boxplot()` method (box-and-whisker plot)

In [None]:
ar = df_experiments.boxplot()

add a couple of values to the dataframe that will be explicit outliers

In [None]:
df_new_cube = pd.DataFrame({'cube': [100,121]})

df_experiments = pd.concat([df_experiments,df_new_cube])

In [None]:
ar = df_experiments.boxplot()

For advanced work with graphs (including histograms), import the `matplotlib` library

Let's use the `ylim(y_min, y_max)` method to change the scale along the vertical axis. If you want to change the scale along the horizontal axis, call the `xlim(x_min, x_max)` method. Both methods have two parameters: the minimum and maximum desired values for the plot. The methods are called from the matplotlib library.

In [None]:
import matplotlib.pyplot as plt 

plt.ylim(0, 14)

ar = df_experiments.boxplot()

# Data slices

We have already considered methods for finding slices of dataframe data before, we will consider one more additional

back to the task of selling video games

In [None]:
df.head()

generate multiple tasks
- where Genre is Sports
- where EU_Sales more NA_Sales 
- where Year in 2005,2010, 2015
- where Year not in 2019
- where Year not in 2005 and 2007 and Genre is Simulation or Racing 

In [None]:
df.loc[df['Genre']=='Sports'].head()

In [None]:
df.loc[df['NA_Sales']<df['EU_Sales']].head()

To check for specific values in a column, call the `isin()` method

In [None]:
df.loc[df['Year'].isin([2005,2010,2015])].head()

symbol `~` execution result is True if condition is False

In [None]:
df.loc[~df['Year'].isin([2019])].head()

**AND** the result of executing the logical operation is True only if both conditions are True `&`

**OR** execution result is True if at least one of the conditions is True `|`

In [None]:
df.loc[(df['Genre']=='Racing') & (~df['Year'].isin([2005,2007]))]

In [None]:
df.loc[((df['Genre']=='Racing') | (df['Genre']=='Simulation')) & (~df['Year'].isin([2005,2007]))].head()

# Slicing data using the query() method

The prerequisite for the slice is written in a string that is passed as an argument to the `query()` method. And the method is applied to the dataframe. As a result, we get the desired cut.

Conditions specified in the `query()` parameter (they are very similar to SQL):
- They support different comparison operations: `!=`, `>`, `>=`, `<`, `<=`.
- They check whether specific values are included in the list using the construction: `Year in [2019, 2018]`. If you need to find out if there are certain values in the list, write like this: `Year not in [2019, 2018]`.
- They work with logical operators in the usual way, where “or” is `or`, “and” is `and`, “not” is `not`. The terms in parentheses are optional. Without parentheses, operations are performed in the following order: first not, then and, and finally or.

write the last slice condition using the query() method

In [None]:
# df.loc[((df['Genre']=='Racing') | (df['Genre']=='Simulation')) & (~df['Year'].isin([2005,2007]))].head()

df.query('''(Genre == Racing" 
         or Genre == "Simulation") 
         and Year not in [2005, 2007]''').head()

# Query() capabilities

In addition to combining conditions, you can perform mathematical operations in `query()`:

In [None]:
df.query('NA_Sales < 2 * EU_Sales ').head()

And even call methods:

In [None]:
df.query('NA_Sales < EU_Sales.mean()').head()

You can also include external variables (not from the dataframe) in `query()`. When you mention such a variable, mark it with an `@` sign:

In [None]:
sales = [123,213,123]

df.query('NA_Sales > @sales and EU_Sales not in @sales').head()

other ways to pass variables to queries

In [None]:
df.query(f'NA_Sales > {sales} and EU_Sales > {sales}').head()

In [None]:
самые продаваемые игры по регионам 

самые высокие рейтинги по платформам 

самые прибыльные года 





# Working with date and time

upload data according to the information at the ticket point

In [None]:
df = pd.read_excel('./datasets/data_lesson_date.xlsx')

In [None]:
df

In the date_time column, the entry date and time. From the description of the data, it is known that the entry time was indicated in the UTC + 0 time zone, in ISO format. This means that at first the year, month, day go together; then the alphabetic date and time separator T; then hours, minutes and seconds - again merged.

Use the `to_datetime()` method, which converts strings to dates. The format argument of the `to_datetime()` method specifies special symbols, the order of which corresponds to the order of the numbers in the date string:
- `%d` - day of the month (from 01 to 31);
- `%m` — month number (from 01 to 12);
- `%Y` - four-digit year number (for example, 2019);
- `%y` is a two-digit year number (for example, 19);
- `Z` or `T` is the standard date and time separator;
- `%H` - hour number in 24-hour format;
- `%I` - hour number in 12-hour format;
- `%M` - minutes (from 00 to 59);
- `%S` - seconds (from 00 to 59).

When displaying datetime format values on the screen, Python automatically separates them with `-` and `:` characters to make it easier for a person to read the data.

In [None]:
df['date_time_normal'] = pd.to_datetime(df['date_time'], format='%Y-%m-%d')

In [None]:
df

In [None]:
df.info()

The fact that operations are to be performed with dates is reported to pandas separately, through the `dt` attribute (date time). The `dt` attribute specifies that the data type to which the methods will be applied is datetime. This means that pandas will not accept them as strings or numbers.
To round the time, use the `dt.round()` method. As a parameter, it is passed a string with a rounding step in hours, days, minutes, or seconds:
- `D` - day
- `H` - hour
- `min` or `T` — minute
- `S` - second

`dt.round()` rounds to the nearest value - not always up. A quarter past six, rounded by `dt.round()` , becomes five o'clock:

In [None]:
df['time_rounded'] = df['date_time_normal'].dt.round('1H')
df

To be sure that the time will be rounded up to a larger value, the `dt.ceil()` method is called. To a smaller value, “down”, rounded by the `dt.floor()` method.

In [None]:
df['time_rounded_ceil'] = df['date_time_normal'].dt.ceil('1H')
df['time_rounded_floor'] = df['date_time_normal'].dt.floor('1D')
df

The number of the day in the week is found using the `dt.weekday` method. Monday is day number 0, and Sunday is the sixth day.

In [None]:
df['weekday'] = df['date_time_normal'].dt.weekday
df

Sometimes you need to change the time to another time zone. `pd.Timedelta()` is responsible for time shifts. The number of hours is passed in the parameter: (hours=).
Let's minus 7 hours to Riga time and find out what time it was in New York when the dataframe events took place in Riga:

In [None]:
df['date_time_new_york'] = df['date_time_normal'] + pd.Timedelta(hours=-7)
df

# Graphs

The `plot()` method is responsible for plotting graphs in pandas. Here is a simple example:

In [None]:
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'b': [4, 9, 16, 25, 18, 22, 27, 30, 31, 33]})
df.plot() 

The `plot()` method plotted graphs based on column values from the dataframe. Indexes are located on the abscissa (x) axis, and column values are located on the ordinate (y) axis.
The names for the graphs are indicated by a string or a variable in the title parameter:

In [None]:
gr = df.plot(title='A and B')

Let's add a precision graphic, pass the `style` parameter, with a value of `o-`, to mark the values of the table with dots.

In [None]:
#style='х'
#style='o'

gr = df.plot(style='o-')

Recall that the indices are plotted along the horizontal axis. But what if this way of representing is not suitable for analysis? You can change the indices themselves or pass the parameters of the axes to the `plot()` method. So, the abscissa axis (x) will be assigned the values of column b, and the ordinate axis (y) - the values of column a:

In [None]:
gr = df.plot(x='b', y='a', style='o-') 

![jupyter](./pict/graf_c.png)

You can also pass borders. Adjust the borders with the `xlim` and `ylim` parameters

Let's add grid lines: with them it will be easier to understand which values are displayed. Specify the `grid` parameter  equal to `True` (this means that you need to display the grid):

In [None]:
gr = df.plot(x='b', y='a', style='x-', xlim=(0, 40), grid=True) 

The size of the chart is controlled through the figsize parameter. The `width` and `height` of the construction area in inches are passed to the parameter in brackets: `figsize = (x_size, y_size)`. Let's compare graphs with different sizes:

In [None]:
gr = df.plot(x='b', y='a', style='o-', xlim=(0, 40), grid=True, figsize=(4,3)) 

#figsize
# 5:4
# 4:3
# 3:2
# 16:10
# 5:3 figsize=(10, 6)
# 16:9
# 64:27
# 43:18
# 32:9 

# Grouping with pivot_table()

back to our video games and a little about **OKCAM'S RAZOR**

In [None]:
df = pd.read_csv('./datasets/vgsales.csv')

Entia non sunt multiplicanda praeter necessitatem - "Entities should not be multiplied unnecessarily.". The essence of the principle: perfection should be simple.

![jupyter](./pict/python.png)

We will not multiply entities unnecessarily, and in the future it is better to get rid of intermediate variables that we will not reuse. Let's apply the `plot()` method to the result of `query()` without any slice. You get a structure like this:

`data.query().plot()`

Let's pass the required parameters. To make the code easy to read, let's write it in several lines. So the code looks clearer:

for example, let's take sales in NA in the cut of the year

In [None]:
(df
    .query('Genre == "Sports"')
    .plot(x='Year', y='NA_Sales', 
          style='o-', grid=True, figsize=(12, 6))
)

the graph looks terrible without explicit aggregation. Let's try to use the methods known to us

Let's turn to `pivot_table()`. Let's add a pivot table in the chain between `query()` and `plot()`:

In [None]:
(df
    .query('Genre == "Sports"')
    .pivot_table(index='Year', values='NA_Sales', aggfunc=np.sum)
    .plot(grid=True, style='o-', figsize=(12, 5))
) 

In [None]:
(df
    .query('Genre == "Sports"')
    .pivot_table(index='Year', values='NA_Sales', aggfunc={np.sum, np.max})
    .plot(grid=True, style='o-', figsize=(12, 5))
) 

# More charts

`seaborn` is another powerful charting library, this library has more flexible settings than others

In [None]:
import seaborn as sns

In [None]:
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.set(rc={'figure.figsize':(16,9)})

ar = sns.lineplot(data=df.pivot_table(index=['Year','Genre'], values='NA_Sales', aggfunc=np.sum), 
                  x="Year", y="NA_Sales", hue="Genre")

In [None]:
ar = sns.barplot(data=df.query('Year > 2012'), x='Genre', y='NA_Sales', hue='Year')

but if you dreamed of interactive charts, then it is better to pay attention to these libraries

In [None]:
import plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

In [None]:
df['Genre'].iplot(kind='hist', xTitle='Genre released',
                  yTitle='count', title='VG Stat')

If we want to plot overlay histograms, it's just as easy:

In [None]:
df.pivot_table(columns='Genre', index='Year', values='Global_Sales').iplot(
        kind='box',
        yTitle='Sales',
        title='Global Sales by Genre')