# PANDAS

__Pandas__ is a powerful _open-source_ data manipulation and analysis library for _Python_. 

It provides data structures and functions for efficiently __handling and analyzing structured data__, such as tables or spreadsheets.

With __pandas__, you can easily _load_, _manipulate_, _analyze data_, perform _data cleaning_ and _preprocessing_ tasks, and create _visualizations_.

It is widely used in _data science_, _machine learning_, and _data analysis_ projects.

To import the pandas library and assigns it the alias 'pd', you could make `import pandas as pd`.

## The Series Data Structure

A __pandas Series__ is a _one-dimensional labeled array_ capable of holding any data type. It is similar to a _column_ in a spreadsheet or a SQL table, or a _dictionary-like_ object. It is a fundamental _data structure_ in __pandas__ library, which is widely used for data manipulation and analysis in Python.

A __pandas Series__ consists of two main components: the _data_ and the _index_. The _data_ can be of any type, such as integers, floats, strings, or even complex objects. The _index_ is a sequence of labels that uniquely identifies each element in the Series.

Some key features of pandas Series include:
- Vectorized operations: Series supports vectorized operations, allowing you to perform element-wise computations efficiently.
- Label-based indexing: You can access elements in a Series using labels instead of integer-based indexing.
- Alignment: Series automatically aligns data based on the index, making it easy to perform operations on multiple Series with different indexes.

To create a __Series__, you can pass a list, array, or dictionary-like object to the `pd.Series()` constructor. You can also specify custom index labels if needed.

In [8]:
import pandas as pd
# Create a Series object from a list of strings
list_elements = ['a','b','c','d','e',2,1,3,4,5]
serie_1 = pd.Series(list_elements)
print('Serie 1:', type(serie_1),'\n', serie_1)

# Create a Series object from a list of numbers
list_numbers = [1,2,3,4,5]
serie_2 = pd.Series(list_numbers)
print('Serie 2:', type(serie_2), '\n', serie_2)

# Create a Series object from a list of numbers with a None value
list_numbers_with_none = [1,2,None,4,5]
serie_3 = pd.Series(list_numbers_with_none)
print('Serie 3:', type(serie_3), '\n', serie_3)

Serie 1: <class 'pandas.core.series.Series'> 
 0    a
1    b
2    c
3    d
4    e
5    2
6    1
7    3
8    4
9    5
dtype: object
Serie 2: <class 'pandas.core.series.Series'> 
 0    1
1    2
2    3
3    4
4    5
dtype: int64
Serie 3: <class 'pandas.core.series.Series'> 
 0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
dtype: float64


In [6]:
# Create a Series object from a dictionary
dict_data = {'a':1, 'b':2, 'c':3, 'd':4}
serie_4 = pd.Series(dict_data)
print('Serie 4:', type(serie_4), '\n', serie_4)

# Get the values of the Series index
print('Serie 4 index:', serie_4.index)

Serie 4: <class 'pandas.core.series.Series'> 
 a    1
b    2
c    3
d    4
dtype: int64
Serie 4 index: Index(['a', 'b', 'c', 'd'], dtype='object')


In [7]:
# Create a series object from a list of tuple pairs
list_tuples = [('est-1', 'Ana'), ('est-2','Bob'),('est-3','Hermenejildo')]

serie_5 = pd.Series(list_tuples)
print('Serie 5:', type(serie_5), '\n', serie_5)

Serie 5: <class 'pandas.core.series.Series'> 
 0             (est-1, Ana)
1             (est-2, Bob)
2    (est-3, Hermenejildo)
dtype: object


In [4]:
# Create a series object from a list as values and a list as index
list_index = ['a', 'b', 'c', 'd']
list_values = ['Ana', 'Bob', 'Claire', 'Hermenejildo']
serie_6 = pd.Series(list_values, index=list_index)
print('Serie 6:', type(serie_6), '\n', serie_6)

Serie 6: <class 'pandas.core.series.Series'> 
 a             Ana
b             Bob
c          Claire
d    Hermenejildo
dtype: object


In [15]:
# Query a Series object by boolean indexing
print(serie_4>1)
print('-'*30)
print('Serie 4 > 2:\n', serie_4[serie_4>2])

a    False
b     True
c     True
d     True
dtype: bool
------------------------------
Serie 4 > 2:
 c    3
d    4
dtype: int64


In [16]:
# Query a Series object by fancy indexing
index_list = ['a','c']
print('serie 6(["a", "c"]):\n', serie_6[index_list])

serie 6(["a", "c"]):
 a       Ana
c    Claire
dtype: object


In [17]:
# Query a Series object using loc[]
print('Serie_6.loc[["a","d"]]:\n', serie_6.loc[['a','d']])

Serie_6.loc[["a","d"]]:
 a             Ana
d    Hermenejildo
dtype: object


In [20]:
# Query a Series object using iloc[]
print('Serie_6.iloc[:2]:\n', serie_6.iloc[:2])
print('-'*30)
print('Serie_6.iloc[-2:]:\n', serie_6.iloc[-2:])

Serie_6.iloc[:2]:
 a    Ana
b    Bob
dtype: object
------------------------------
Serie_6.iloc[-2:]:
 c          Claire
d    Hermenejildo
dtype: object


## The DataFrame Data Structure

A __pandas DataFrame__ is a _two-dimensional_, _labeled_ data structure in _Python_ that is commonly used for _data manipulation and analysis_. It consists of _rows_ and _columns_, similar to a table in a relational database.

The __DataFrame__ can store _heterogeneous data types_ and provides various operations and functions to perform data manipulation, filtering, grouping, and statistical analysis.

To access and manipulate the data in the __DataFrame__, you can use various _methods_ and _attributes_ provided by the __pandas__ library.

For more information on __pandas DataFrame__, refer to the [official pandas documentation](https://pandas.pydata.org/docs/reference/frame.html).

In [29]:
# create dataframes from lists

list_example = [[1,2,3], [4,5,6],[7,8,9],[10,11,12]]
df = pd.DataFrame(list_example)
print('DataFrame:\n',df)
print('DataFrame.shape:',df.shape)
print('DataFrames.columns', df.columns)
print('DataFrame index:', df.index)
print('-'*50)
df.head()

DataFrame:
     0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12
DataFrame.shape: (4, 3)
DataFrames.columns RangeIndex(start=0, stop=3, step=1)
DataFrame index: RangeIndex(start=0, stop=4, step=1)
--------------------------------------------------


Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12


In [31]:
# create a dataframe from a list of dictionaries
list_dictionaries = [{'a':1, 'b':2, 'c':3},{'a':4, 'b':5, 'c':6},{'a':14, 'b':24, 'c':36}]
df = pd.DataFrame(list_dictionaries)
print('Columns of the Dataframe:', df.columns)
print('Index of the dataframe', df.index)
print('-'*50)
df.head()

Columns of the Dataframe: Index(['a', 'b', 'c'], dtype='object')
Index of the dataframe RangeIndex(start=0, stop=3, step=1)
--------------------------------------------------


Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,14,24,36


In [32]:
# create a dataframe from a csv file
csv_df = pd.read_csv('csv-files/StudentsInfo.csv')
csv_df.head()

Unnamed: 0,Name,Company,Position,Salary
0,Alice Johnson,"Hernandez, Griffith and Nelson",Petroleum engineer,4740
1,David Jones,Gomez-Garcia,"Geologist, engineering",73329
2,Eva Brown,Blevins LLC,Microbiologist,83245
3,Frank Davis,Greene-Wilson,Museum education officer,74390
4,Jack Anderson,Butler PLC,"Scientist, research (maths)",69851


In [44]:
# create a dataframe from a json file
json_df = pd.read_json('json-files/StudentsInfo.json')
print('Shape:', json_df.shape)
json_df.head()

Shape: (50, 4)


Unnamed: 0,id,name,career,college
0,1,Alice Johnson,Computer Science,Tech University
1,2,Bob Smith,Mechanical Engineering,Engineering Institute
2,3,Carol Williams,Electrical Engineering,Tech University
3,4,David Jones,Biology,Science College
4,5,Eva Brown,Physics,Tech University


In [47]:
# describe a dataframe
print('JSON DATAFRAME DESCRIBE:\n')
print(json_df.describe())
print('-'*50)
print('CSV DATAFRAME DESCRIBE:\n')
csv_df.describe()

JSON DATAFRAME DESCRIBE:

             id
count  50.00000
mean   25.50000
std    14.57738
min     1.00000
25%    13.25000
50%    25.50000
75%    37.75000
max    50.00000
--------------------------------------------------
CSV DATAFRAME DESCRIBE:



Unnamed: 0,Salary
count,30.0
mean,53491.6
std,27432.862783
min,4740.0
25%,27472.0
50%,64104.0
75%,72981.25
max,93995.0


In [48]:
# get information about a dataframe
print('JSON DATAFRAME INFO:\n')
print(json_df.info())
print('-'*50)
print('CSV DATAFRAME INFO:\n')
csv_df.info()

JSON DATAFRAME INFO:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       50 non-null     int64 
 1   name     50 non-null     object
 2   career   50 non-null     object
 3   college  50 non-null     object
dtypes: int64(1), object(3)
memory usage: 1.7+ KB
None
--------------------------------------------------
CSV DATAFRAME INFO:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      30 non-null     object
 1   Company   30 non-null     object
 2   Position  30 non-null     object
 3   Salary    30 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 1.1+ KB


In [52]:
# indexes and columns
json_changed_name = json_df.set_index('name')
print(json_changed_name.head())

# Change column name
csv_df = csv_df.rename(columns={'Salary':'Salary (UD/year)', 'Position':'Current Position'})
csv_df.head()

                id                  career                college
name                                                             
Alice Johnson    1        Computer Science        Tech University
Bob Smith        2  Mechanical Engineering  Engineering Institute
Carol Williams   3  Electrical Engineering        Tech University
David Jones      4                 Biology        Science College
Eva Brown        5                 Physics        Tech University


Unnamed: 0,Name,Company,Current Position,Salary (UD/year)
0,Alice Johnson,"Hernandez, Griffith and Nelson",Petroleum engineer,4740
1,David Jones,Gomez-Garcia,"Geologist, engineering",73329
2,Eva Brown,Blevins LLC,Microbiologist,83245
3,Frank Davis,Greene-Wilson,Museum education officer,74390
4,Jack Anderson,Butler PLC,"Scientist, research (maths)",69851


### Using Datetime into Pandas

In [57]:
# converting a column to datetime with to_datetime()
list_dates = ['2023-02-05', '2024-05-23']
df = pd.DataFrame(list_dates, columns=['date_example'])
print(df.info())
df['date_example'] = pd.to_datetime(df['date_example'])
print('-'*80)
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date_example  2 non-null      object
dtypes: object(1)
memory usage: 148.0+ bytes
None
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date_example  2 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 148.0 bytes
None


Unnamed: 0,date_example
0,2023-02-05
1,2024-05-23


In [68]:
sales_df = pd.read_json('json-files/sales_data.json')
sales_df = sales_df.set_index('code')
print(sales_df.info())
sales_df['date'] = sales_df['date'].replace('2008-Dic-23','2008-Dec-23')
sales_df['full_date'] = sales_df['date'] + ' ' + sales_df['hour']
sales_df['full_date'] = pd.to_datetime(sales_df['full_date'], format='mixed')
print(sales_df.info())

sales_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Sale-1117-HdZH to Sale-2882-umoM
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   client       1000 non-null   object 
 1   total_price  1000 non-null   float64
 2   date         1000 non-null   object 
 3   hour         1000 non-null   object 
 4   credit_card  1000 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 46.9+ KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Sale-1117-HdZH to Sale-2882-umoM
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   client       1000 non-null   object        
 1   total_price  1000 non-null   float64       
 2   date         1000 non-null   object        
 3   hour         1000 non-null   object        
 4   credit_card  1000 non-null   int64         
 5   full_date    1000 non-null   datetime64[ns

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-06-03 08:13:18
Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-08-22 02:57:20
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-05-22 02:10:06
Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-12-24 21:28:34
Sale-8141-KGOb,Kenneth Marsh,498.43,1974-03-14,03:55:36,2239583806605394,1974-03-14 03:55:36


In [70]:
# converting a column from datetime to string with strftime()
sales_df['full_date_formatted'] = sales_df['full_date'].dt.strftime("%Y-%b-%d %H:%M:%S")
sales_df.head()

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date,full_date_formatted
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-06-03 08:13:18,2024-Jun-03 08:13:18
Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-08-22 02:57:20,2014-Aug-22 02:57:20
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-05-22 02:10:06,1979-May-22 02:10:06
Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-12-24 21:28:34,1973-Dec-24 21:28:34
Sale-8141-KGOb,Kenneth Marsh,498.43,1974-03-14,03:55:36,2239583806605394,1974-03-14 03:55:36,1974-Mar-14 03:55:36


In [74]:
# converting a column from datetime to a timestamp with timestamp()
sales_df['timestamp'] = sales_df['full_date'].apply(lambda x: x.timestamp())
sales_df['timestamp'] = sales_df['timestamp'].astype('int64')
sales_df.head()
#print(sales_df.info())

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date,full_date_formatted,timestamp
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-06-03 08:13:18,2024-Jun-03 08:13:18,1717402398
Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-08-22 02:57:20,2014-Aug-22 02:57:20,1408676240
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-05-22 02:10:06,1979-May-22 02:10:06,296187006
Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-12-24 21:28:34,1973-Dec-24 21:28:34,125616514
Sale-8141-KGOb,Kenneth Marsh,498.43,1974-03-14,03:55:36,2239583806605394,1974-03-14 03:55:36,1974-Mar-14 03:55:36,132465336


In [75]:
sales_df.describe()

Unnamed: 0,total_price,credit_card,full_date,timestamp
count,1000.0,1000.0,1000,1000.0
mean,489.7138,3.550223e+17,1996-04-08 18:37:02.880999936,828988600.0
min,11.85,60443920000.0,1970-01-09 21:20:40,768040.0
25%,226.6775,180011200000000.0,1982-09-14 20:13:30.750000,400882400.0
50%,486.43,3520377000000000.0,1995-11-17 17:16:20.500000,816628600.0
75%,732.7525,4632459000000000.0,2009-05-24 22:14:57.249999872,1243203000.0
max,999.71,4.997675e+18,2024-06-03 08:13:18,1717402000.0
std,288.791957,1.215563e+18,,495186600.0


In [76]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Sale-1117-HdZH to Sale-2882-umoM
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   client               1000 non-null   object        
 1   total_price          1000 non-null   float64       
 2   date                 1000 non-null   object        
 3   hour                 1000 non-null   object        
 4   credit_card          1000 non-null   int64         
 5   full_date            1000 non-null   datetime64[ns]
 6   full_date_formatted  1000 non-null   object        
 7   timestamp            1000 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 102.6+ KB


### Queries and Transformations

In [77]:
#drop a column
sales_df = sales_df.drop(columns=['full_date', 'timestamp'], axis=1)
sales_df.head()

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date_formatted
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18
Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-Aug-22 02:57:20
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06
Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-Dec-24 21:28:34
Sale-8141-KGOb,Kenneth Marsh,498.43,1974-03-14,03:55:36,2239583806605394,1974-Mar-14 03:55:36


In [78]:
#drop a row
sales_df = sales_df.drop(index='Sale-8141-KGOb')
sales_df.head()

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date_formatted
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18
Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-Aug-22 02:57:20
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06
Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-Dec-24 21:28:34
Sale-7567-cCLb,William Smith,857.13,2016-07-04,18:26:22,3541411469135312,2016-Jul-04 18:26:22


In [80]:
# query a dataframe by column
client = sales_df['client']
print(type(client))
client.head()

<class 'pandas.core.series.Series'>


code
Sale-1117-HdZH          Gary Meza
Sale-5078-hqkc       Carol Martin
Sale-8209-xGVn     Jeremy Spencer
Sale-9093-bfcp    Pamela Anderson
Sale-7567-cCLb      William Smith
Name: client, dtype: object

In [85]:
#fancy query in a dataframe
sales_df[(sales_df['total_price'] > 733.1) & (sales_df['date'] > '2023-12-31')]


Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date_formatted
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18
Sale-0614-sxLF,Don Hoffman,962.64,2024-01-03,23:55:55,5479577821251756,2024-Jan-03 23:55:55
Sale-2906-BTeQ,Erin Hodge,998.86,2024-01-08,06:56:38,4567559715138618,2024-Jan-08 06:56:38
Sale-1627-BccC,Walter Jimenez,912.7,2024-02-25,11:16:18,6540696554624030,2024-Feb-25 11:16:18
Sale-6265-XNWg,Douglas Valencia,927.28,2024-05-11,11:36:54,30569642128340,2024-May-11 11:36:54
Sale-2093-FIOe,Kenneth Anderson,767.85,2024-01-15,12:02:58,342591867309051,2024-Jan-15 12:02:58
Sale-3056-LhkB,Lorraine Cline,909.9,2024-04-25,00:18:15,502076502454,2024-Apr-25 00:18:15


In [90]:
# query a dataframe by row with loc
print(sales_df.loc['Sale-9093-bfcp'])
print('-'*80)
sales_df.loc['Sale-9093-bfcp', ['client', 'total_price', 'full_date_formatted']] #selecting which columns to show
#print(type(sales_df.loc['Sale-9093-bfcp']))


client                      Pamela Anderson
total_price                          166.57
date                             1973-12-24
hour                               21:28:34
credit_card                3558512811558836
full_date_formatted    1973-Dec-24 21:28:34
Name: Sale-9093-bfcp, dtype: object
--------------------------------------------------------------------------------


client                      Pamela Anderson
total_price                          166.57
full_date_formatted    1973-Dec-24 21:28:34
Name: Sale-9093-bfcp, dtype: object

In [98]:
# query a dataframe by row with iloc
sales_df['full_date'] = pd.to_datetime(sales_df['full_date_formatted'])
sales_date_ordered_df = sales_df.sort_values(by=['full_date', 'credit_card'], ascending=False)
sales_date_ordered_df.iloc[:3]
#sales_date_ordered_df.head()

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date_formatted,full_date
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,2024-06-03 08:13:18
Sale-6265-XNWg,Douglas Valencia,927.28,2024-05-11,11:36:54,30569642128340,2024-May-11 11:36:54,2024-05-11 11:36:54
Sale-3056-LhkB,Lorraine Cline,909.9,2024-04-25,00:18:15,502076502454,2024-Apr-25 00:18:15,2024-04-25 00:18:15


In [99]:
# query a dataframe using a boolean mask
sales_df[sales_df['total_price'] > 800]

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date_formatted,full_date
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,2024-06-03 08:13:18
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06,1979-05-22 02:10:06
Sale-7567-cCLb,William Smith,857.13,2016-07-04,18:26:22,3541411469135312,2016-Jul-04 18:26:22,2016-07-04 18:26:22
Sale-1256-GSGV,Travis Reid,913.50,1987-01-01,12:19:45,30435866051594,1987-Jan-01 12:19:45,1987-01-01 12:19:45
Sale-1938-dqto,Christopher Green DDS,994.77,2023-09-19,12:54:28,4569498359626448,2023-Sep-19 12:54:28,2023-09-19 12:54:28
...,...,...,...,...,...,...,...
Sale-9251-Odat,Cheryl Daniels,965.73,1983-07-17,07:20:13,4658428960463230,1983-Jul-17 07:20:13,1983-07-17 07:20:13
Sale-9183-yPLQ,David Cook,893.87,2016-03-01,02:52:29,370735642831427,2016-Mar-01 02:52:29,2016-03-01 02:52:29
Sale-8638-viTW,Jose Craig,905.32,2003-09-14,21:55:12,3509353286994309,2003-Sep-14 21:55:12,2003-09-14 21:55:12
Sale-6471-qsQu,Daniel Wilson,862.45,1976-10-30,01:56:35,4322622466883208,1976-Oct-30 01:56:35,1976-10-30 01:56:35


In [103]:
# query a dataframe using query()
sales_df.query('total_price > 800')
sales_df.query('total_price > 800 and client == "Jeremy Spencer"')

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,full_date_formatted,full_date
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06,1979-05-22 02:10:06


In [104]:
# get missing values using isnull()
call_center_df = pd.read_json('json-files/call_center_comments.json')
print(call_center_df.describe())
print('-'*80)
print(call_center_df.info())
print('-'*80)
call_center_df.head()


                           date_time  attention_time
count                         980227   949911.000000
mean   2022-03-30 18:32:03.527999744       77.464617
min              2020-01-01 00:06:19        5.000052
25%              2021-02-14 10:54:23       41.203544
50%              2022-03-30 12:07:02       77.506505
75%              2023-05-14 14:36:18      113.708504
max              2024-06-27 11:47:56      149.999933
std                              NaN       41.843117
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column             Non-Null Count    Dtype         
---  ------             --------------    -----         
 0   code               1000000 non-null  object        
 1   client             1000000 non-null  object        
 2   product            1000000 non-null  object        
 3   date_time          980227 non-null   date

Unnamed: 0,code,client,product,date_time,attention_time,comment,country_of_origin,city
0,JLz-1254574,Terri Valentine,USB Flash Drive,2021-12-27 21:56:53,41.675933,Write speeds could be faster.,France,Paris
1,YsR-3166466,Jessica Powell,Tablet,2024-06-27 10:11:33,72.571054,Great for reading and streaming videos.,India,Delhi
2,QeH-0295056,Dana Hensley,Portable Projector,2022-02-06 21:59:15,8.573219,Battery life could be longer.,Australia,Brisbane
3,aUd-2224033,Amy Kent,External Hard Drive,2020-10-18 05:14:15,81.441193,Transfer speeds are fast and reliable.,USA,Los Angeles
4,nvp-6413002,Andrea Jones,Portable Projector,2022-10-01 03:44:58,106.643005,Battery life could be longer.,Canada,Calgary


In [None]:
#get missing values using isnull()


In [None]:
# fill missing values using fillna()


In [None]:
# drop missing values using dropna()


In [None]:
# transform column using to_datetime()


In [None]:
# transform column using to_numeric()


In [None]:
# convert column to category using astype()


In [None]:
# merge dataframes using merge()


In [None]:
# concatenate dataframes using concat()


In [None]:
# join dataframes using join()


In [None]:
# group dataframes using groupby()


In [None]:
# group and aggregate dataframes using groupby() and aggregate()


In [None]:
# group and transform dataframes using groupby() and transform()


In [None]:
# group and filter dataframes using groupby() and filter()


In [None]:
# merge dataframes using pivot()


In [None]:
# pivot dataframes using pivot_table()


### Advanced Transformations

In [None]:
# making transformations using apply()


In [None]:
# making transformations using chain transformations


### Statistical Testing

In [None]:
# making a t-test with pandas and scipy


In [None]:
# making an ANOVA test with pandas and scipy


In [None]:
# making a chi-square test with pandas and scipy


In [None]:
# making a correlation validation with pandas 


In [None]:
# p-hacking example


In [None]:
# p-hacking example with multiple testing


In [None]:
# p-value example


In [None]:
# p-value correction with Bonferroni
