In [1]:
import pandas as pd

In [7]:
df = pd.read_csv("../datasets/avocado_kaggle.csv")
df.sample()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
7932,37,2017-04-16,1.09,682386.6,131986.76,151806.57,404.18,398189.09,397484.51,217.5,487.08,conventional,2017,Seattle


In [5]:
df["AveragePrice"] = df["AveragePrice"].apply(lambda x: int(x))

In [9]:
df["AveragePrice"] = df["AveragePrice"].astype("int64")
df["AveragePrice"]

0        1
1        1
2        0
3        1
4        1
        ..
18244    1
18245    1
18246    1
18247    1
18248    1
Name: AveragePrice, Length: 18249, dtype: int64

In [6]:
df.dtypes

Unnamed: 0        int64
Date             object
AveragePrice      int64
Total Volume    float64
4046            float64
4225            float64
4770            float64
Total Bags      float64
Small Bags      float64
Large Bags      float64
XLarge Bags     float64
type             object
year              int64
region           object
dtype: object

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Installation" data-toc-modified-id="Installation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Installation</a></span></li><li><span><a href="#Introduction-to-pandas-data-structures" data-toc-modified-id="Introduction-to-pandas-data-structures-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Introduction to pandas data structures</a></span><ul class="toc-item"><li><span><a href="#Series" data-toc-modified-id="Series-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Series</a></span></li><li><span><a href="#Dataframes" data-toc-modified-id="Dataframes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Dataframes</a></span><ul class="toc-item"><li><span><a href="#From-data-types" data-toc-modified-id="From-data-types-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>From data types</a></span></li><li><span><a href="#From-path" data-toc-modified-id="From-path-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>From path</a></span></li><li><span><a href="#From-databases" data-toc-modified-id="From-databases-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>From databases</a></span></li></ul></li></ul></li><li><span><a href="#Exploratory-analysis-of-a-dataframe" data-toc-modified-id="Exploratory-analysis-of-a-dataframe-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Exploratory analysis of a dataframe</a></span><ul class="toc-item"><li><span><a href="#Meta-information" data-toc-modified-id="Meta-information-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Meta information</a></span></li><li><span><a href="#Previsualization" data-toc-modified-id="Previsualization-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Previsualization</a></span></li><li><span><a href="#Order-a-dataframe" data-toc-modified-id="Order-a-dataframe-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Order a dataframe</a></span></li><li><span><a href="#NaN-values" data-toc-modified-id="NaN-values-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>NaN values</a></span></li><li><span><a href="#Basic-descriptive-statistics" data-toc-modified-id="Basic-descriptive-statistics-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Basic descriptive statistics</a></span></li></ul></li><li><span><a href="#Pandas-usual-methods" data-toc-modified-id="Pandas-usual-methods-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Pandas usual methods</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

# Pandas

![pandas](https://media.giphy.com/media/nVsLCrW5iHf6E/giphy.gif)

## Introduction
Pandas is undoubtedly the most widely used library in the Python ecosystem for data manipulation and analysis. It's fast, powerful, flexible, easy to use and open source.


Among its main features:

- A fast and efficient **DataFrame** object for data manipulation with built-in indexing* 

- **Reading and writing** of data in many formats: Microsoft Excel, CSV, SQL databases, etc;

- Integrated and efficient methods for all types of data manipulation: missing data, subset, union, merge, etc;

- Ease of working with temporary data (in fact, Pandas is named after "PANnel DAta")

- Good **integration with other data analysis or Machine learning libraries**: scikit-learn, scipy, seaborn, plotly, etc;

- It is **widely used** in both the private and academic sectors


Pandas provides high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive. Since its introduction in 2010, it has helped make Python a powerful and productive data analysis environment. The main pandas objects that will be used in this book are the DataFrame, a column-oriented tabular data structure with row and column labels, and the Series, a labeled one-dimensional array object.

Pandas combines the high performance ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases (such as SQL). It provides sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.

![image](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg)



Source: [Forbes](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#1ba071616f63)

## Installation

The first thing you should do will always be
`pip install pandas`, `conda install pandas`

In [1]:
# anaconda
    # enviroments: ironhack
    #pandas
    
# miniconda
    # conda activate ironhack
    #pip install pandas

In [2]:
import pandas as pd
import numpy as np

## Introduction to pandas data structures
To get started with pandas, you'll need to get comfortable with its two working data structures: Series and DataFrame. Although they are not a universal solution to all problems, they provide a solid and easy-to-use foundation for most applications.

### Series
A Serie is a one-dimensional array object containing a sequence of values ​​(of NumPy-like types) and an associated array of data labels, called its index. The simplest Series is formed from a single array of data:

The string representation of a Series displayed interactively shows the index on the left and the values ​​on the right. Since we didn't specify an index for the data, a default one consisting of the integers 0 to N - 1 (where N is the length of the data) is created. You can get the array representation and the index object of the Series through its values ​​and index attributes, respectively:

Another way to think of a Series is as a fixed-length ordered dict, since it is a mapping of index values ​​to data values. It can be used in many contexts where a dictionary could be used.
If you have data contained in a Python dict, you can create a Series from it by passing the dict:

When only one dict is passed, the resulting String index will have the keys of the dict in order. You can override this by passing the keys of the dict in the order you want them to appear in the resulting String:

Here, the three values ​​found in sdata were placed in the appropriate places, but since no value was found for 'California', it appears as NaN (not a number), which is considered in pandas to mark missing values ​​or NA. Since "Utah" was not included in the states, it is excluded from the resulting object.

### Dataframes
Pandas can read and write data from a wide variety of formats. [Read the documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)
Although one of the most common is from a dict of lists of equal length or NumPy arrays:

#### From data types

`from dictionaries with lists as values`

Since we are using Jupyter Notebook, pandas DataFrame objects will be displayed as a more browser-friendly HTML table. [More info on this](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html)

`from list of dictionaries`

If I create a dataframe through a list of dictionaries:
- Each dictionary will be a row
- The keys will be the names of the columns
- They have to have the same structure

[pandas from dict](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html)

#### From path

`.csv`

`.xlsx`, `xls`, `xlsm`, `xlsb`, `odf`, `ods`, `odt`

In [37]:
#df_from_excel = pd.read_excel("../datasets/Online Retail.xlsx", engine="openpyxl", nrows=5)
#df_from_excel.head()

`Reading different sheeets`

In [38]:
# Default tab (first one)

#df_from_excel_new_tab = pd.read_excel("../datasets/Online Retail.xlsx", engine="openpyxl", nrows=5)
#df_from_excel_new_tab.head()

In [39]:
# Other tab: (new_tab)
#df_from_excel_new_tab = pd.read_excel("../datasets/Online Retail.xlsx", engine="openpyxl", "new_tab", nrows=5)
#df_from_excel_new_tab.head()

`web`: https://raw.githubusercontent.com/datapackage-examples/sample-csv/master/sample.csv

#### From databases

`sql`: [docs](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) 

```python
from sqlite3 import connect

conn = connect(':memory:')
df = pd.read_sql('SELECT column_1, column_2 FROM sample_data', conn)

df.to_sql('test_data', conn)
```

`mongodb`

```python
import pymongo
from pymongo import MongoClient

client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))
```

## Exploratory analysis of a dataframe

### Meta information

`shape, columns, dtypes, info, describe`

[How dtypes work](https://numpy.org/doc/stable/reference/arrays.dtypes.html)

### Previsualization

`head`

By default head shows me the first 5 rows, I can see some more or less by passing a number as a parameter

`tail`

### Order a dataframe

Same operation, but give me only concrete columns

`sample`

`display`

### NaN values
NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special floating point value and cannot be converted to a type other than float.
The NaN value is one of the main problems in data analysis. It is very essential to deal with NaN to get the desired results.

###  Basic descriptive statistics

## Pandas usual methods
```python
df.head() # prints the head, default 5 rows
df.tail() # set the tail, default 5 rows
df.describe() # statistical description
df.info() # df information
df.columns # show column
df.index # show index
df.dtypes # show column data types
df.plot() # make a plot
df.hist() # make a histogram
df.col.value_counts() # counts the unique values ​​of a column
df.col.unique() # returns unique values ​​from a column
df.copy() # copies the df
df.drop() # remove columns or rows (axis=0,1)
df.dropna() # remove nulls
df.fillna() # fills nulls
df.shape # dimensions of the df
df._get_numeric_data() # select numeric columns
df.rename() # rename columns
df.str.replace() # replace columns of strings
df.astype(dtype='float32') # change the data type
df.iloc[] # locate by index
df.loc[] # locate by element
df.transpose() # transposes the df
df.T
df.sample(n, frac) # sample from df
df.col.sum() # sum of a column
df.col.max() # maximum of a column
df.col.min() # minimum of one column
df[col] # select column
df.col
df.isnull() # null values
df.isna()
df.notna() # not null values
df.drop_duplicates() # remove duplicates
df.reset_index(inplace=True) # reset the index and overwrite
```

## Further materials

* [Read the docs!](https://pandas.pydata.org/pandas-docs/stable/index.html)
* [Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
* [Exercises to practice](https://github.com/guipsamora/pandas_exercises)
* [More on merge, concat, and join](https://realpython.com/pandas-merge-join-and-concat/#pandas-join-combining-data-on-a-column-or-index). And [even more!](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
 