## 1. Demonstration of main Pandas methods 

**[Pandas](http://pandas.pydata.org)** is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like `.csv`, `.tsv`, or `.xlsx`. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with `Matplotlib` and `Seaborn`, `Pandas` provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in `Pandas` are implemented with **Series** and **DataFrame** classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of `Series` instances. `DataFrames` are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

## 1. Demonstration of main Pandas methods 

**[Pandas](http://pandas.pydata.org)** is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like `.csv`, `.tsv`, or `.xlsx`. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with `Matplotlib` and `Seaborn`, `Pandas` provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in `Pandas` are implemented with **Series** and **DataFrame** classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of `Series` instances. `DataFrames` are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

Importing numpy and pandas and setting options for display precision:

In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.precision", 2)

We'll demonstrate the main methods in action by analyzing a [dataset](https://bigml.com/user/francisco/gallery/dataset/5163ad540c0b5e5b22000383) on the churn rate of telecom operator clients. Let's read the data (using `read_csv`), and take a look at the first 5 lines using the `head` method:

In [2]:
df = pd.read_csv('C:/Users/eaus317/Desktop/Zaeem_EAU/Spring 2020/BI & Big Data/BI&BD_SEN4210_Lab4_05_04_2020/telecom_churn.csv')

FileNotFoundError: [Errno 2] File b'C:/Users/eaus317/Desktop/Zaeem_EAU/Spring 2020/BI & Big Data/BI&BD_SEN4210_Lab4_05_04_2020/telecom_churn.csv' does not exist: b'C:/Users/eaus317/Desktop/Zaeem_EAU/Spring 2020/BI & Big Data/BI&BD_SEN4210_Lab4_05_04_2020/telecom_churn.csv'

In [3]:
type(df)

NameError: name 'df' is not defined

In [None]:
df.head(5)

Displaying whole dataframe:

In [None]:
df

<details>
<summary>About printing DataFrames in Jupyter notebooks</summary>
<p>
In Jupyter notebooks, Pandas DataFrames are printed as these pretty tables seen above while `print(df.head())` is less nicely formatted.
By default, Pandas displays 20 columns and 60 rows, so, if your DataFrame is bigger, use the `set_option` function as shown in the example below:

```python
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
```
</p>
</details>

Recall that each row corresponds to one client, an **instance**, and columns are **features** of this instance.

Let’s have a look at data dimensionality, feature names, and feature types.

In [None]:
df["State"]

In [None]:
df.State

Selecting multiple columns one to another:

In [None]:
df[df.columns[0:2]]

Selecting multiple columns (different):

In [None]:
df.loc[:, ["State", "Area code"]]

Print the type and datatype of a particular feature/column:

In [None]:
type(df.State) #OR type(df["State"])

In [None]:
print(df["State"].dtype)

Print the dimensions/shape of the dataframe:

In [None]:
df.shape

From the output, we can see that the table contains 3333 rows and 20 columns.

Now let's try printing out column names using `columns`:

In [None]:
df.columns

Detailed info of the dataframe:

In [None]:
df.info()

`bool`, `int64`, `float64` and `object` are the data types of our features. We see that one feature is logical (`bool`), 3 features are of type `object`, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 3333 observations, the same number of rows we saw before with `shape`.

We can **change the column type** with the `astype` method. Let's apply this method to the `Churn` feature to convert it into `int64`:

In [None]:
df['Churn'] = df['Churn'].astype('bool')

The `describe` method shows basic statistical characteristics of each numerical feature (`int64` and `float64` types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [None]:
df.describe()

In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the `include` parameter.

In [None]:
df.describe(include=['object', 'bool'])

For categorical (type `object`) and boolean (type `bool`) features we can use the `value_counts` method. Let's have a look at the distribution of `Churn`:

In [None]:
df['Churn'].value_counts()

2850 users out of 3333 are *loyal*; their `Churn` value is `0`. To calculate fractions, pass `normalize=True` to the `value_counts` function.

In [None]:
df['Churn'].value_counts(normalize=True)

Distribution of values for almost every feature can be quickly displayed as histogram:

In [None]:
df['Total day minutes'].hist()