<a href="https://colab.research.google.com/github/Sagi15G/de_python_course/blob/main/pandas_data_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Before we start..

The most important thing you have to remember is that 

**you don't have to remember anything!**


##When you have a question, just **GOOGLE** it  


#Agenda

* Basic Concepts
  * Pandas Series
  * Pandas DataFrame

* Data Extraction 
* Data Exploration
* Basic Data Manipultaions


# Basic Concepts

In [None]:
# It is very common to use the "pd" alias when importing pandas
import pandas as pd

"""
Since pandas is a well maintained library, 
let's check the version that we are using so we can be specific when 
tyring to find a solution in Google
"""
pd.__version__


### Series
**`Series`** in pandas is a representatin of a column.

You can see a `Series` as a list of values

[documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) 

In [None]:
# note that the dtype inferred from the data
color_series = pd.Series(data=['Blue', 'Green', 'Red'])
color_series

In [None]:
size_series = pd.Series(data=[4.5, 3, 2])
size_series

### DataFrame
**`DataFrame`** in pandas is a representatin of a table.

Each `DataFrame` contains one or more `Series` 

[documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) 

In [None]:
# There are many ways to create a DataFrame
# For example, here we use the Series objects from before
my_df = pd.DataFrame( { 'color': color_series, 'size': size_series })
my_df

In [None]:
# Creating a DataFrame from list of lists

# initialize list of lists
list_of_lists = [['Tom', 10], ['Jerry', 15], ['Maya']]
  
# Create the pandas DataFrame
pd.DataFrame(list_of_lists, columns = ['Name', 'Age'])

###Data Extraction
One of the powerfull features of pandas is the simplicity of reading data from different sources.

[read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

[read_json](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)

[read_sql](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)

Let's clone into our machine some CSVs from github and load them into a DataFrame



In [None]:
# using "!" on this notebook will run a bash command
! git clone https://github.com/Sagi15G/de_python_course.git

In [None]:
# listing the files on our working directory to see the files
!ls

In [None]:
# this is the path of our course files
!ls de_python_course/data/movies_csv/*

In [None]:
# quickly go over the README file
!cat de_python_course/data/movies_csv/README.txt

In [None]:
# now that we have some files let's load one of them
movies_df = pd.read_csv(filepath_or_buffer='de_python_course/data/movies_csv/movies.csv',sep=',')
# head--> print the first rows (default as 5)
movies_df.head()

### Data Exploration


In [None]:
# Let's see some pandas functions that will help us to understand our data

# info(). --> columns metadata
movies_df.info()

In [None]:
# describe --> data basic statistics
# note that in default we see only numeric columns
movies_df.describe()

In [None]:
# using include='all' will show all the columns
movies_df.describe(include='all')

In [None]:
# another usefull function to start with is value_counts()
"""
same as this SQL command
SELECT genres,COUNT(*)
FROM movies_df
GROUP BY genres
"""

movies_df["genres"].value_counts()

### Basic Data Manipulations

In [None]:
# let's fist load another data frame
ratings_df = pd.read_csv(filepath_or_buffer='de_python_course/data/movies_csv/ratings.csv',sep=',')
ratings_df.head(10)


In [None]:
# a quick exploration
ratings_df.describe()

In [None]:
# accessing columns is pretty straightforward 
# for example, let's filter only movies with rating >=4
ratings_df[ratings_df["rating"] >= 4].head()

In [None]:
# now let's add a new column that will show us timestamp as a datetime object
ratings_df['timestamp_as_datetime'] = pd.to_datetime(ratings_df['timestamp'],unit='s')
ratings_df.head()



In [None]:
# let's calculate the average rating per year
ratings_df.groupby(ratings_df["timestamp_as_datetime"].dt.year, sort=True)["rating"].mean()

In [None]:
# A spoiler of the next sessions: why not plotting it?
ratings_df.groupby(ratings_df["timestamp_as_datetime"].dt.year, sort=True)["rating"].mean().plot()

### Summary
There are multiple ways to work with pandas

Don't try to remember them, just Google your need and save precious time (: