In [2]:
import numpy as np

# Introduction to pandas (part I)


![image.png](attachment:image.png)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Pandas-data-structures" data-toc-modified-id="Pandas-data-structures-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Pandas data structures</a></span><ul class="toc-item"><li><span><a href="#DataFrame" data-toc-modified-id="DataFrame-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>DataFrame</a></span></li><li><span><a href="#Series" data-toc-modified-id="Series-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Series</a></span></li></ul></li><li><span><a href="#Read/Write-data" data-toc-modified-id="Read/Write-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Read/Write data</a></span></li><li><span><a href="#Basic-DataFrame-operations" data-toc-modified-id="Basic-DataFrame-operations-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Basic DataFrame operations</a></span></li><li><span><a href="#Basic-Plots-&amp;-Statistics" data-toc-modified-id="Basic-Plots-&amp;-Statistics-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Basic Plots &amp; Statistics</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

## Introduction

Pandas is undoubtedly the most widely used library in the Python ecosystem for data analysis and manipulation. It is fast, powerful, flexible, easy to use, and open source! 


Among its main features are
* A fast and efficient **DataFrame** object for data manipulation with integrated indexing
* **Read and write** data in a multitude of formats: Microsoflt Excel, CSV, SQL databases, etc.;
* Integrated and efficient methods for all kinds of data manipulation: missing data, subsetting, join, merge, etc; 
* Facility to work with temporary data (in fact, Pandas is called "PANnel DAta") 
* Good **integration with other libraries** of data analysis or Machine learning: scikit-learn, scipy, seaborn, plotly, etc;
* It is **widely used** in both the private and academic sectors


![image.png](attachment:image.png)

Source: [Forbes](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#1ba071616f63)

## Pandas data structures

In [1]:
# let unleash the magic!!
import pandas as pd

### DataFrame 

DataFrame (notice the naming, it is an [Object](https://github.com/pandas-dev/pandas/blob/v1.1.3/pandas/core/frame.py#L340-L9264)!) is probably the most widely used pandas object. It is a 2-dimensional tagged data structure with potentially different types of columns. You can think of it as a Spreadsheet or an SQL table. 

![image.png](attachment:image.png)

### Series

These are named columns of a DataFrame (more correctly, a dataframe is a dictionary of Series). The entries of the series should have homogenous type.

## Read/Write data

pandas can read & write data from a large variety of formats. [Read the docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

**From a dict...**

...or lists of lists, or other Python data structures. [Read the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [22]:
variable_a = np.random.randint(7, 9, 20)
variable_b = np.random.randint(6, 8, 20)

data = {'a': variable_a,
        'b': variable_b}
print(data)

{'a': array([7, 8, 8, 8, 7, 8, 7, 8, 7, 8, 7, 7, 8, 8, 7, 8, 7, 7, 7, 8]), 'b': array([6, 7, 7, 7, 6, 6, 6, 7, 6, 7, 7, 7, 6, 7, 7, 7, 6, 7, 6, 7])}


In [84]:
df_random = pd.DataFrame(data)

**From a local file**

In [29]:
! ls ../datasets

AppleStore.csv     vehicles_messy.csv


In [17]:
# pd.read_csv('../datasets/vehicles_messy.csv')

**From url**

This is the [link](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user) to the data

In [60]:
df_from_url =  pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                           sep='|', 
                           index_col='user_id')

And all that can be read can be saved:

In [26]:
df_from_url.to_csv('../datasets/my_downloaded_dataset.csv')

**Question:** How can I delete the file I just created right from the notebook?

Creating Series objects is just as easy:

In [31]:
my_variable = np.random.randint(7, 9, 20)
my_variable

array([8, 7, 8, 7, 7, 7, 7, 7, 7, 8, 8, 8, 7, 7, 7, 7, 7, 8, 7, 7])

**Exercise** ([Read the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html))

* How could we add a name to the Series?
* How could we change the datatype to float?
* How could we access the name of the Series we just created?

In [47]:
# Remember DataFrames and Series are Objects ;)

## Basic DataFrame operations

In [65]:
df = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
                 sep='|', 
                 index_col='user_id')

In [74]:
# show the first rows

In [75]:
# show that last rows

In [77]:
# select one column

# select multiple columns

# get the unique values in a column

In [76]:
# slicing rows with .iloc (Unusually for Python, both endpoints are included in the slice.)

Follow this [link](https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different) for a discussion on .iloc (index) vs .loc (labels)

In [None]:
# show the types of the dataframe columns

In [None]:
# show more info about the dataframe

In [None]:
# change datatype of one column

**Note**: Most operations produce copies (unless inplace=True is specified). To use `inplace=True` is not advised. It is better to assign the transformed dataframe to a new variable

<p></p>

<center>Explicit is better than implicit.</center>

<p></p>
Let´s see an example with unexpected consequences:

In [None]:
# drop columns inplace in a dataframe created as = another DataFrame


In [50]:
# rename columns 

Rows can also be dropped. Note that the indices do not reset. The index is associated with the row, not with the order.


In [None]:
# drop row

In [52]:
# set a column as the new index

In [98]:
# filter observations (these are SQL type of operations): SELECT * FROM df WHERE ... AND ... OR ...

    # Keep only writers
    
    # Keep only writers and students
    
    # Keep only zip_codes starting with 21
    
    # Keep only writers and students OR zip_codes starting with 21
    
    # Keep only unique observations

In [72]:
# set the zip_code to be the index

In [96]:
# Create a new column that with the year they were born

In [None]:
# Add two columns

In [97]:
# Apply a custom (lambda, for example, ;) function to a a column

## Basic Plots & Statistics

In [81]:
# describe the dataset
    # include=[object], all ...

In [86]:
# get the mean (DataFrame or Series)

In [None]:
# get the standard deviation (DataFrame or Series)

In [None]:
# count the number of appearances of given observations (DataFrame or Series)

# How could I know if there are more F or M

In [None]:
# Get the mean age of each professional

In [109]:
# Get the mean age and std of each professional

Have a [look to the built-in plot options.](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html)

In [None]:
# plot an histogram of the column age

In [108]:
# plot a pie chart of the number of F/M 

OK. That was fun. **Please, NEVER** [plot a pie chart again](https://www.geckoboard.com/blog/pie-charts/#:~:text=The%20case%20against%20pie%20charts,reading%20accurate%20values%20is%20difficult.) ;)

## Summary 

**Topics**


**Students feedback**



## Further materials

* [Read the docs!](https://pandas.pydata.org/pandas-docs/stable/index.html)
* [Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
* [Exercises to practice](https://github.com/guipsamora/pandas_exercises)
