# Introduction to Data Analysis with pandas


![alt text] [logo1]

[logo1]: https://i.pinimg.com/236x/ca/f8/e8/caf8e8b084c55999ca3e5baf7b29c11b--panda-party-baby-shawer.jpg?b=t

## jkjk

![alt text][logo]

[logo]: https://pandas.pydata.org/_static/pandas_logo.png






## IEEELearn Coding Workshops



Welcome to the IEEELearn Python workshops, hope you liked all the workshops from morning! 
We will now move to the last workshop in this bootcamp, Data Analysis with pandas

The scope of this workshop will be limited to Data Wrangling and the syntax of python, as learning analysis would require you to learn algorithms like Regression, for which we don't have enough time. Maybe in the next workshop series :)



To start off we will import pandas from the library


In [0]:
import pandas as pd                    #notice the usage of import and the as keyword

### Reading data

The first step for any data analysis project, is to actually have the data. Data can be stored in many file formats like CSV, XLSX, SQL and so on. So the fundamental operation is to read the data. 

Before reading data, you need to know how data is stored when we use pandas. There are two important terms to keep in mind 

1. __Series__ - Series is a sequence of data values. It is analogous to a list which you learnt previously.
2. __DataFrame__ - Dataframe is a an array of individual entries. Each entry has a unique row number and cilumn number. This implies that it is analogous to a table in excel, or a 2-D array in general.

Dataframe is essentially a collection of series' (which in turn are just columns)

We will now create a Dataframe, Series and read data from a CSV file.

In [0]:
pd.Series([1, 2, 3, 4, 5]) #this creates a column containing the values 1,2,3,4,5

In [0]:
pd.Series([50, 60, 70], index=['2017 Sales', '2018 Sales', '2019 Sales'], name='Product A sales report')

Now that we are comfortable with Series, we will move on to DataFrames

In [0]:
pd.DataFrame({'Yes': [50, 20], 'No': [30, 90]})

In [0]:
pd.DataFrame({'Bob': ['Meh', 'Duh!'], 'Sue': ['Pffftt..', 'Damnn']})

In [0]:
pd.DataFrame({'Bob': ['Meh', 'Duh!'], 'Sue': ['Pffftt..', 'Damnn']}, index=['Line 1', 'Line 2'])

#### Reading CSV files

Being able to create a DataFrame and Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. 

By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this

CSV stands for Comma Separated Value
The following syntax is used to read the csv file stored in a local machine (offline)

In [0]:
wine_reviews = pd.read_csv("content/winemag-data_first150k.csv")     #since the csv file is in my local path, I can refer it directly, but for files in some other directory, it is better to use the full path

wine_reviews

Incase you want to use a csv file which is online, it can be done in the following way

In [0]:
url = "https://raw.githubusercontent.com/stoltzmaniac/wine-reviews-kaggle/master/winemag-data_first150k.csv"

wine_reviews = pd.read_csv(url,index_col = 0)

wine_reviews

We will now go through some of the common functionalities that can be done on a dataframe after it is loaded

Please refer to this cheatsheet for a quick reference at all times. 
[Pandas Cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)


In [0]:
wine_reviews.shape     # shape returns the size of the dataframe (rows, colums)

In [0]:
wine_reviews.head     # returns the top few results from the dataset


## Indexing and Selection of Data 

If you have a look at the `series` of pandas, it just looks like a fancy `dict`, which was covered in the advanced workshop. 

We will therefore exploit this feature to perform various operations. 



In [0]:
wine_reviews.country # both the statements are equivalent, so we can use it as a dict

wine_reviews['country']

In [0]:
wine_reviews['country'][1]

Yes, this is very useful, but pandas provides us with two more powerful operators which help us with indexing and slicing.

1.  `loc` - label based selection
2. `iloc` - index based selection 

Let us have a look at `iloc` first. So as mentioned, `iloc` is index based selection. What this essentially implies is that we can access the data by using numerical position.


In [0]:
wine_reviews.iloc[10]

Both `iloc` and `loc` are __row first, column second__

In [0]:
wine_reviews.iloc[:,8]  #this will give me the variety of all the wines in the dataset

In [0]:
req_list = [1,5,7,9]

wine_reviews.iloc[req_list]

#wine_reviews.iloc[req_list,1]



We will now have a look at the `loc` function.
`loc` is a label based selection operator. This uses the data index value and not its numerical position. 

In [0]:
wine_reviews.loc[5,'country']

In [0]:
wine_reviews.loc[:, ['country','province','variety']]

In [0]:
wine_reviews.set_index("country")

### Conditional selection

We use True/False booleans to perform selection

In [0]:
wine_reviews.loc[wine_reviews.country == "Portugal"]

In [0]:
wine_reviews.loc[(wine_reviews.country == "Portugal") & (wine_reviews.points >=80)]


## Summarization functions

We will now look at a few of the important summarization functions. It can be done by `describe` function

We will now obtain the summary of the column `points`

In [0]:
wine_reviews.points.describe()

As you can see the `describe` function works for numerical data. What about for string data ? 

In [0]:
wine_reviews.country.describe()

To get the average points, we use the `mean` function. 

To view the unique values in the dataset, we use the `unique` function

To view the frequency of the unique values, we use the `value_counts` function

In [0]:
wine_reviews.points.mean()

In [0]:
wine_reviews.country.unique()

In [0]:
wine_reviews.country.value_counts()

Grouping can be used for many purposes. You've already used the `value_counts` function. We can replicate what `value_counts` does using groupby by doing the following:



In [0]:
wine_reviews.groupby('country').country.count()

There are a few __aggregation__ functions we can use with `groupby` to perform operations. 
A few of them are :
+ `max`
+ `min`


In [0]:
max_points = wine_reviews.groupby('points').price.max()
min_points = wine_reviews.groupby('points').price.min()

max_points
min_points

Another groupby method worth mentioning is `agg`, which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:



In [0]:
wine_reviews.groupby(['country']).price.agg([len, min, max])

We will now see something called as `multi-index`. It has multiple levels

In [0]:
countries_reviewed = wine_reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed


To convert the multi index to the normal index, we will use the `reset_index` function. 

In [0]:
countries_reviewed.reset_index()

## Sorting

You can sort the data by using the `sort_values` method

In [0]:
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

In [0]:
countries_reviewed.sort_values(by='country', ascending=True)

In [0]:
countries_reviewed.sort_index()

In [0]:
countries_reviewed.sort_values(by=['country','len'])

FInding the datatype of a variable is very easy. You can also find the datatypes of alll the features at once

In [0]:
wine_reviews.price.dtype

In [0]:
wine_reviews.dtypes

To explicitly convert a variable from one type to another we use `astype` function

In [0]:
wine_reviews.points.astype('float64')

Imagine we want to replaec the name US to United States of America. We will now use the `replace` function

In [0]:
wine_reviews.country.replace('US','United States of America')

# Plotting of Data

## Univariate plotting with pandas

In [0]:
wine_reviews['country'].value_counts().head(10).plot.bar()


Let us now calculate the number of reviews with the range of points





In [0]:
wine_reviews['points'].value_counts().sort_index().plot.bar()

In [0]:
wine_reviews['points'].value_counts().sort_index().plot.line()

In [0]:
wine_reviews['points'].value_counts().sort_index().plot.area()

The charts above work well for all the discrete variables. What if I want to plot a interval variable (eg: 10-20).We would have to plot a histogram and such graphs.

In [0]:
wine_reviews['points'].plot.hist()

## Bivariate plotting with pandas

# Bivariate plotting with pandas

<table>
<tr>
<td><img src="https://i.imgur.com/bBj1G1v.png" width="350px"/></td>
<td><img src="https://i.imgur.com/ChK9zR3.png" width="350px"/></td>
</tr>
<tr>
<td style="font-weight:bold; font-size:16px;">Scatter Plot</td>
<td style="font-weight:bold; font-size:16px;">Hex Plot</td>
</tr>
<tr>
<td>df.plot.scatter()</td>
<td>df.plot.hexbin()</td>
</tr>
<tr>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for interval and some nominal categorical data.</td>
</tr>
</table>

In [0]:
wine_reviews[wine_reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points')

A  **hex plot** aggregates points in space into hexagons, and then colors those hexagons based on the values within them:

In [0]:
wine_reviews[wine_reviews['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15)
