<a href="https://colab.research.google.com/github/TheMaze45/Pandas/blob/main/Recap_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recap on Pandas DataFrames

DataFrames in Pandas are mutable two-dimensional structures of data with labeled axes where:



*   each row represents a different observation
*   each column represents a different variable



Before we can create a DataFrame, we have to import the Pandas module into 
our coding enviroment.

In [2]:
# We can do that with
import pandas as pd # you can name pandas anything you want, but importing pandas as pd is the standard !

# 1. Import a CSV-file (comma seperated values) into the DataFrame


When working with data, most of time you don't create it by yourself, your main job as a Data-Analyst/Scientist is to get data and work with it.

We call this process the importing(or "reading") of data, sometimes from a CSV-file or from a Database.

With Pandas it is easy to import data and convert it into a DataFrame.

Here we take for example a CSV-file that we worked with before.

##Important:

When importing a file from your Google Drive, make sure that the access permission for the file is set to "anyone with the link can read/view the file",otherwise you will have trouble importing the data.

In [3]:
# url is just the link path to your file
url = "https://drive.google.com/file/d/1FYhN_2AzTBFuWcfHaRuKcuCE6CWXsWtG/view?usp=sharing" # orderlines.csv
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
# This here creates the dataframe
df = pd.read_csv(path)

# Take a quick glance at the dataframe to see if the import worked.
# Don't forget to run the import pandas as pd first !
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


As we can see, the table looks good, we have valid data for every column and row.

#2. DataFrame dimensions

If you don't want or can't count the number of rows and columns by hand on your DataFrame, don't worry, there is a nice little helper



```
df.shape
```
With this attribute we can a quick overview of the size of our table.


In [4]:
df.shape

(293983, 7)

It returns a [tuple](https://www.w3schools.com/python/python_tuples.asp) (data type with multiple items stored in a single var).

The first element of our tuple is the number of rows in our DataFrame, which is 293983 rows, the second element is the amount of columns, for our example it is 7.

If we want to present this kind of information in a nice way, we could extract the values and put the information into a print statement.

In [5]:
nrows = df.shape[0]
ncols = df.shape[1]
print(f'The number of rows in our dataframe is{nrows} and we have a total number of {ncols} columns.')

The number of rows in our dataframe is293983 and we have a total number of 7 columns.


The [DataFrame.size](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.size.html) returns the total number of values that a DataFrame has.

In our case it would be the number of rows (293983) * by the number of columns (7). So a total of 2057881 values.

In [6]:
df.size

2057881

We can verify this with a boolean statement.


```
df.shape[0] * df.shape[1] == df.size
```
If the numbers match, it should return True.


In [7]:
df.shape[0] * df.shape[1] == df.size

True

With the [DataFrame.ndim](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ndim.html) attribute we can calculate the number of dimensions.
A DataFrame always has two dimensions, because it consists of rows & columns.

A series on the other hand would only have one dimension.

In [18]:
df.ndim

2

In [16]:
# Create example series to show the difference
test_series = pd.Series({"Country":"Germany","Capital":"Berlin","Residents":83200000})
test_series

Country       Germany
Capital        Berlin
Residents    84000000
dtype: object

In [17]:
test_series.ndim

1

#3. DataFrame exploration

As already said above [DataFrame.head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) and [DataFrame.tail()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html) are quite nice and quick ways to get a first glance at the DataFrame.

By default the first 5 / last 5 rows will be shown.

You can ofc, change that, if you give the function an argument.


In [19]:
# Without arguments
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


In [20]:
# Without arguments
df.tail()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
293978,1650199,527398,0,1,JBL0122,42.99,2018-03-14 13:57:25
293979,1650200,527399,0,1,PAC0653,141.58,2018-03-14 13:57:34
293980,1650201,527400,0,2,APP0698,9.99,2018-03-14 13:57:41
293981,1650202,527388,0,1,BEZ0204,19.99,2018-03-14 13:58:01
293982,1650203,527401,0,1,APP0927,13.99,2018-03-14 13:58:36


In [21]:
# Display the first 10 entries
df.head(10)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.00,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38
5,1119114,295310,0,10,WDT0249,231.79,2017-01-01 01:14:27
6,1119115,299544,0,1,APP1582,1.137.99,2017-01-01 01:17:21
7,1119116,299545,0,1,OWC0100,47.49,2017-01-01 01:46:16
8,1119119,299546,0,1,IOT0014,18.99,2017-01-01 01:50:34
9,1119120,295347,0,1,APP0700,72.19,2017-01-01 01:54:11


In [22]:
# Display the last 2 entries
df.tail(2)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
293981,1650202,527388,0,1,BEZ0204,19.99,2018-03-14 13:58:01
293982,1650203,527401,0,1,APP0927,13.99,2018-03-14 13:58:36


### General info about our DataFrame

With the methods 

*   [DataFrame.info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)
*   [DataFrame.describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
*   [DataFrame.nunique()](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.nunique.html)

We can have a general overview of what is inside our DataFrame





```
df.info() 
```
This method tells us:

*   How the data is stored
*   Whether there are any missing values
*   How many rows and columns exist in our DataFrame


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB




```
df.describe()
```
This method gives us an overview of the [descriptive statistics](https://www.scribbr.com/statistics/descriptive-statistics/) for the ***numerical columns of our DataFrame***.

In [24]:
df.describe()

Unnamed: 0,id,id_order,product_id,product_quantity
count,293983.0,293983.0,293983.0,293983.0
mean,1397918.0,419999.116544,0.0,1.121126
std,153009.6,66344.486479,0.0,3.396569
min,1119109.0,241319.0,0.0,1.0
25%,1262542.0,362258.5,0.0,1.0
50%,1406940.0,425956.0,0.0,1.0
75%,1531322.0,478657.0,0.0,1.0
max,1650203.0,527401.0,0.0,999.0
