<a href="https://colab.research.google.com/github/Nrashani/Python_learning/blob/Assignment/Nimash_Pandas_advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


# Advanced data manipulation and visualisation with Python and Pandas


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)


# Introduction

pandas is a library written for the Python for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

So since it's a module, we need to:
- download pandas first from `pip install pandas`
- import it on our notebook

In [None]:
# importing pandas
import pandas as pd

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Load game Dataset

Dataset from Kaggle competition, slightly modified for the purposes of this class.

### Game description
This is a non-existing Pokemon multiplayer game, where 100 players play against each other. Players can catch pokemon each others pokemon using pokeballs. If your pokemon is caught by another player, you lose the match. At the end, the winner of the game is the last player standing.


**Here is an explanation of the columns**

boosts - Number of boost items used.

precisionCatch - Number of pokemon caught with a single ball.

heals - Number of healing items used.

Id - Player’s Id

catchStreaks - Max number of pokemon caught in a short amount of time.

catches - Number of pokemon caught.

matchDuration - Duration of match in seconds.

matchId - ID to identify match.

matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.

rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.

revives - Number of times this player revived teammates.

rideDistance - Total distance traveled in vehicles measured in meters.

swimDistance - Total distance traveled by swimming measured in meters.

walkDistance - Total distance traveled on foot measured in meters.

pokeballsAcquired - Number of pokeballs picked up.

winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.

groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.

numGroups - Number of groups we have data for in the match.

maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.

winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.


In [None]:
# Load the dataset in a pandas Dataframe
df = pd.read_csv('pokemon.csv')
df.head()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Dealing with `nan` values

In Pandas there are two ways in which you can deal `nan` values :
- drop the rows / columns that contain `nan` value
- replace that `nan` value with some other value

In [None]:
# Check for nan values
print(df.isnull().sum())

In [None]:
# drop columns if more than 50% missing values
df = df.dropna(thresh=df.shape[0]*0.5, axis=1)
print(df.isnull().sum())

**QUESTION**

Has anything changed?

Let's try to fill some columns with values

In [None]:
# Fill walkDistance with the mean of the column
df['walkDistance'].fillna(df['walkDistance'].mean(), inplace=True)
print(df.isnull().sum())

In [None]:
# Alternatively, you can drop rows with nan values, by default axis=0
df = df.dropna()
print(df.isnull().sum())

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## More functions

### Sorting

You can sort like in the ORDER BY in sql by doing

```
df.sort_values(YOUR_COLUMN, ascending=True/False)
```

Try it out!


In [None]:
# sorting dataframe according to the values of column `winPoints`

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Grouping the data according to a column

You can group by like in SQL by dowing



```
df.groupBy(YOUR_COLUMNS).SOME_OPERATION()
```



In [None]:
# grouping the data by matchId and count how many rows

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Plotting Graphs

In [None]:
# plotting scatter graph
df.plot(y='precisionCatch', x='winPoints', kind='scatter')

In [None]:
# You can even use Google Colab to plot for you! Try it out.
df[["walkDistance", "winPoints"]]

# Visualization with Sweetviz

Sweetviz is a reporting tool that speeds up your EDA.

In [None]:
!pip install sweetviz

In [None]:
import sweetviz as sv

my_report = sv.analyze(df)
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

# Note: If the report does not appear, you have to download the html file and open it in your browser.

# The End !!

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
