# PANDAS

__Pandas__ is a powerful _open-source_ data manipulation and analysis library for _Python_. 

It provides data structures and functions for efficiently __handling and analyzing structured data__, such as tables or spreadsheets.

With __pandas__, you can easily _load_, _manipulate_, _analyze data_, perform _data cleaning_ and _preprocessing_ tasks, and create _visualizations_.

It is widely used in _data science_, _machine learning_, and _data analysis_ projects.

To import the pandas library and assigns it the alias 'pd', you could make `import pandas as pd`.

## The Series Data Structure

A __pandas Series__ is a _one-dimensional labeled array_ capable of holding any data type. It is similar to a _column_ in a spreadsheet or a SQL table, or a _dictionary-like_ object. It is a fundamental _data structure_ in __pandas__ library, which is widely used for data manipulation and analysis in Python.

A __pandas Series__ consists of two main components: the _data_ and the _index_. The _data_ can be of any type, such as integers, floats, strings, or even complex objects. The _index_ is a sequence of labels that uniquely identifies each element in the Series.

Some key features of pandas Series include:
- Vectorized operations: Series supports vectorized operations, allowing you to perform element-wise computations efficiently.
- Label-based indexing: You can access elements in a Series using labels instead of integer-based indexing.
- Alignment: Series automatically aligns data based on the index, making it easy to perform operations on multiple Series with different indexes.

To create a __Series__, you can pass a list, array, or dictionary-like object to the `pd.Series()` constructor. You can also specify custom index labels if needed.

In [2]:
import pandas as pd

# Create a Series object from a list of strings
list_strings = ['a', 'b', 'c', 'd', 'e']
serie_1 = pd.Series(list_strings)
print("Serie 1:", serie_1)

# Create a Series object from a list of numbers
list_numbers = [1, 2, 3, 4, 5]
serie_2 = pd.Series(list_numbers)
print("Serie 2:", serie_2)

# Create a Series object from a list of numbers with a None value
list_numbers_with_none = [1, 2, None, 4, 5]
serie_3 = pd.Series(list_numbers_with_none)
print("Serie 3:", serie_3)

Serie 1: 0    a
1    b
2    c
3    d
4    e
dtype: object
Serie 2: 0    1
1    2
2    3
3    4
4    5
dtype: int64
Serie 3: 0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
dtype: float64


In [3]:
# Create a Series object from a dictionary
dict_data = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
serie_4 = pd.Series(dict_data)
print("Serie 4:", serie_4)

# Get the values of the Series index
print("Serie 4 index:", serie_4.index)

Serie 4: a    1
b    2
c    3
d    4
e    5
dtype: int64
Serie 4 index: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


In [4]:
# Create a series object from a list of tuple pairs
list_tuples = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
serie_5 = pd.Series(list_tuples)
print("Serie 5:", serie_5)

Serie 5: 0    (a, 1)
1    (b, 2)
2    (c, 3)
3    (d, 4)
4    (e, 5)
dtype: object


In [5]:
# Create a series object from a list as values and a list as index
list_index = ['a', 'b', 'c', 'd', 'e']
list_values = [1, 2, 3, 4, 5]
serie_6 = pd.Series(list_values, index=list_index)
print("Serie 6:\n", serie_6)

for index, value in serie_6.items():
    print(f"Index: {index}, Value: {value}")

Serie 6:
 a    1
b    2
c    3
d    4
e    5
dtype: int64
Index: a, Value: 1
Index: b, Value: 2
Index: c, Value: 3
Index: d, Value: 4
Index: e, Value: 5


In [6]:
# Query a Series object by boolean indexing
print("Serie 6 > 2:", serie_6[serie_6 > 2])
print("-"*10)
print(serie_6 > 2)

Serie 6 > 2: c    3
d    4
e    5
dtype: int64
----------
a    False
b    False
c     True
d     True
e     True
dtype: bool


In [7]:
# Query a Series object by faccy indexing
print("Serie 6[['a', 'b']]:", serie_6[['a', 'b']])

Serie 6[['a', 'b']]: a    1
b    2
dtype: int64


In [8]:
# Query a Series object using loc[]
print("Serie 6.loc[['a', 'b']]:", serie_6.loc[['a', 'b']])

Serie 6.loc[['a', 'b']]: a    1
b    2
dtype: int64


In [9]:
# Query a Series object using iloc[]
print("Serie 6.iloc[0:3]:", serie_6.iloc[0:3])

Serie 6.iloc[0:3]: a    1
b    2
c    3
dtype: int64


## The DataFrame Data Structure

A __pandas DataFrame__ is a _two-dimensional_, _labeled_ data structure in _Python_ that is commonly used for _data manipulation and analysis_. It consists of _rows_ and _columns_, similar to a table in a relational database.

The __DataFrame__ can store _heterogeneous data types_ and provides various operations and functions to perform data manipulation, filtering, grouping, and statistical analysis.

To access and manipulate the data in the __DataFrame__, you can use various _methods_ and _attributes_ provided by the __pandas__ library.

For more information on __pandas DataFrame__, refer to the [official pandas documentation](https://pandas.pydata.org/docs/reference/frame.html).

In [10]:
# create dataframes from lists
list_example = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df = pd.DataFrame(list_example)
print("Dataframe from list:", df)
print("Dataframe from list shape:", df.shape)
print("Dataframe from list columns:", df.columns)
print("Dataframe from list index:", df.index)

Dataframe from list:    0  1  2
0  1  2  3
1  4  5  6
2  7  8  9
Dataframe from list shape: (3, 3)
Dataframe from list columns: RangeIndex(start=0, stop=3, step=1)
Dataframe from list index: RangeIndex(start=0, stop=3, step=1)


In [11]:
# create a dataframe from a list of dictionaries
list_dict = [{'a': 1, 'b': 2, 'c': 3}, {'a': 4, 'b': 5, 'c': 6}, {'a': 7, 'b': 8, 'c': 9}]
df = pd.DataFrame(list_dict)
print("Dataframe from list of dictionaries:", df)
print("Dataframe from list columns:", df.columns)
print("Dataframe from list index:", df.index)

Dataframe from list of dictionaries:    a  b  c
0  1  2  3
1  4  5  6
2  7  8  9
Dataframe from list columns: Index(['a', 'b', 'c'], dtype='object')
Dataframe from list index: RangeIndex(start=0, stop=3, step=1)


In [12]:
# create a dataframe from a csv file
df_csv = pd.read_csv('csv-files/StudentsInfo.csv')
print(df_csv.head())


            Name                         Company                     Position  \
0  Alice Johnson  Hernandez, Griffith and Nelson           Petroleum engineer   
1    David Jones                    Gomez-Garcia       Geologist, engineering   
2      Eva Brown                     Blevins LLC               Microbiologist   
3    Frank Davis                   Greene-Wilson     Museum education officer   
4  Jack Anderson                      Butler PLC  Scientist, research (maths)   

   Salary  
0    4740  
1   73329  
2   83245  
3   74390  
4   69851  


In [13]:
# create a dataframe from a json file
df_json = pd.read_json('json-files/StudentsInfo.json')
print(df_json.head())

   id            name                  career                college
0   1   Alice Johnson        Computer Science        Tech University
1   2       Bob Smith  Mechanical Engineering  Engineering Institute
2   3  Carol Williams  Electrical Engineering        Tech University
3   4     David Jones                 Biology        Science College
4   5       Eva Brown                 Physics        Tech University


In [14]:
# describe a dataframe
print("Dataframe csv describe:\n", df_csv.describe())
print("*"*10)
print("Dataframe json describe:\n", df_json.describe())

Dataframe csv describe:
              Salary
count     30.000000
mean   53491.600000
std    27432.862783
min     4740.000000
25%    27472.000000
50%    64104.000000
75%    72981.250000
max    93995.000000
**********
Dataframe json describe:
              id
count  50.00000
mean   25.50000
std    14.57738
min     1.00000
25%    13.25000
50%    25.50000
75%    37.75000
max    50.00000


In [15]:
# get information about a dataframe
print("Dataframe csv info:\n", df_csv.info())
print("="*50)
print("Dataframe json info:\n", df_json.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      30 non-null     object
 1   Company   30 non-null     object
 2   Position  30 non-null     object
 3   Salary    30 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 1.1+ KB
Dataframe csv info:
 None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       50 non-null     int64 
 1   name     50 non-null     object
 2   career   50 non-null     object
 3   college  50 non-null     object
dtypes: int64(1), object(3)
memory usage: 1.7+ KB
Dataframe json info:
 None


In [16]:
# indexes and columns
df_changed_index = df_csv.set_index('Name')
print("Dataframe csv changed index:\n", df_changed_index.head())
print("*"*50)
print("Dataframe csv changed index columns:\n", df_changed_index.columns)
print("Dataframe csv changed index index:\n", df_changed_index.index)

Dataframe csv changed index:
                                       Company                     Position  \
Name                                                                         
Alice Johnson  Hernandez, Griffith and Nelson           Petroleum engineer   
David Jones                      Gomez-Garcia       Geologist, engineering   
Eva Brown                         Blevins LLC               Microbiologist   
Frank Davis                     Greene-Wilson     Museum education officer   
Jack Anderson                      Butler PLC  Scientist, research (maths)   

               Salary  
Name                   
Alice Johnson    4740  
David Jones     73329  
Eva Brown       83245  
Frank Davis     74390  
Jack Anderson   69851  
**************************************************
Dataframe csv changed index columns:
 Index(['Company', 'Position', 'Salary'], dtype='object')
Dataframe csv changed index index:
 Index(['Alice Johnson', 'David Jones', 'Eva Brown', 'Frank Davis',
      

In [17]:
# rename columns
df_renamed = df_csv.rename(columns={'Salary': 'Salary (USD/year)'})
df_renamed.head()

Unnamed: 0,Name,Company,Position,Salary (USD/year)
0,Alice Johnson,"Hernandez, Griffith and Nelson",Petroleum engineer,4740
1,David Jones,Gomez-Garcia,"Geologist, engineering",73329
2,Eva Brown,Blevins LLC,Microbiologist,83245
3,Frank Davis,Greene-Wilson,Museum education officer,74390
4,Jack Anderson,Butler PLC,"Scientist, research (maths)",69851


## Using Datetime into Pandas

In [18]:
# converting a column to datetime with to_datetime()
sales_df = pd.read_json('json-files/sales_data.json')
print("Sales dataframe:\n", sales_df.info())
print(sales_df[sales_df['code'] == "Sale-4594-TuGI"])
sales_df = sales_df.replace('2008-Dic-23', '2008-12-23')
print("*"*50)
sales_df['date'] = pd.to_datetime(sales_df['date'])
print("Sales dataframe with Date column as datetime:\n", sales_df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   code         1000 non-null   object 
 1   client       1000 non-null   object 
 2   total_price  1000 non-null   float64
 3   date         1000 non-null   object 
 4   hour         1000 non-null   object 
 5   credit_card  1000 non-null   int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 47.0+ KB
Sales dataframe:
 None
              code         client  total_price         date      hour  \
24  Sale-4594-TuGI  Patrick Meyer       150.19  2008-Dic-23  07:32:54   

     credit_card  
24  569507527095  
**************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   code         1000 non-null   

In [19]:
# converting a column from datetime to string with strftime()
sales_df['date_full'] = sales_df['date'].dt.strftime('%Y-%b-%d') + " " + sales_df['hour']
print(sales_df.head())

             code           client  total_price       date      hour  \
0  Sale-1117-HdZH        Gary Meza       832.48 2024-06-03  08:13:18   
1  Sale-5078-hqkc     Carol Martin       156.29 2014-08-22  02:57:20   
2  Sale-8209-xGVn   Jeremy Spencer       832.05 1979-05-22  02:10:06   
3  Sale-9093-bfcp  Pamela Anderson       166.57 1973-12-24  21:28:34   
4  Sale-8141-KGOb    Kenneth Marsh       498.43 1974-03-14  03:55:36   

        credit_card             date_full  
0  6011811575065598  2024-Jun-03 08:13:18  
1  6536303182814044  2014-Aug-22 02:57:20  
2   213185615148626  1979-May-22 02:10:06  
3  3558512811558836  1973-Dec-24 21:28:34  
4  2239583806605394  1974-Mar-14 03:55:36  


In [20]:
# converting a column from datetime to a timestamp with timestamp()
sales_df['timestamp'] = pd.to_datetime(sales_df['date_full']).apply(lambda x: x.timestamp())
print(sales_df.head())
sales_df['timestamp'] = sales_df['timestamp'].astype('int64')
print("="*50)
print(sales_df.head())


             code           client  total_price       date      hour  \
0  Sale-1117-HdZH        Gary Meza       832.48 2024-06-03  08:13:18   
1  Sale-5078-hqkc     Carol Martin       156.29 2014-08-22  02:57:20   
2  Sale-8209-xGVn   Jeremy Spencer       832.05 1979-05-22  02:10:06   
3  Sale-9093-bfcp  Pamela Anderson       166.57 1973-12-24  21:28:34   
4  Sale-8141-KGOb    Kenneth Marsh       498.43 1974-03-14  03:55:36   

        credit_card             date_full     timestamp  
0  6011811575065598  2024-Jun-03 08:13:18  1.717402e+09  
1  6536303182814044  2014-Aug-22 02:57:20  1.408676e+09  
2   213185615148626  1979-May-22 02:10:06  2.961870e+08  
3  3558512811558836  1973-Dec-24 21:28:34  1.256165e+08  
4  2239583806605394  1974-Mar-14 03:55:36  1.324653e+08  
             code           client  total_price       date      hour  \
0  Sale-1117-HdZH        Gary Meza       832.48 2024-06-03  08:13:18   
1  Sale-5078-hqkc     Carol Martin       156.29 2014-08-22  02:57:20   
2  

In [21]:
print(sales_df.describe())

       total_price                        date   credit_card     timestamp
count  1000.000000                        1000  1.000000e+03  1.000000e+03
mean    489.713800  1996-04-08 06:48:57.600000  3.550223e+17  8.289886e+08
min      11.850000         1970-01-09 00:00:00  6.044392e+10  7.680400e+05
25%     226.677500         1982-09-14 06:00:00  1.800112e+14  4.008824e+08
50%     486.430000         1995-11-17 12:00:00  3.520377e+15  8.166286e+08
75%     732.752500         2009-05-24 06:00:00  4.632459e+15  1.243203e+09
max     999.710000         2024-06-03 00:00:00  4.997675e+18  1.717402e+09
std     288.791957                         NaN  1.215563e+18  4.951866e+08


In [22]:
%%timeit -n 100
import numpy as np

print(np.round(np.sum(sales_df['total_price']), 3))

489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8


489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
4

In [23]:
%%timeit -n 100
import numpy as np
total = 0
for price in sales_df['total_price']:
    total += price
print(np.round(total, 3))

489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
489713.8
4

## Queries and Transformations

In [24]:
# drop a column
sales_copy_df = sales_df.copy()
print(sales_copy_df.columns)
sales_copy_df = sales_copy_df.drop(columns=['timestamp'], axis=1)
sales_copy_df.head()

Index(['code', 'client', 'total_price', 'date', 'hour', 'credit_card',
       'date_full', 'timestamp'],
      dtype='object')


Unnamed: 0,code,client,total_price,date,hour,credit_card,date_full
0,Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18
1,Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-Aug-22 02:57:20
2,Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06
3,Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-Dec-24 21:28:34
4,Sale-8141-KGOb,Kenneth Marsh,498.43,1974-03-14,03:55:36,2239583806605394,1974-Mar-14 03:55:36


In [25]:

# drop a row
sales_df = sales_df.set_index('code')
sales_copy_df = sales_df.copy()
sales_copy_df['country'] = 'Colombia'
print(sales_copy_df.head())
sales_copy_df = sales_copy_df.drop(index='Sale-8141-KGOb')
sales_copy_df.head()

                         client  total_price       date      hour  \
code                                                                
Sale-1117-HdZH        Gary Meza       832.48 2024-06-03  08:13:18   
Sale-5078-hqkc     Carol Martin       156.29 2014-08-22  02:57:20   
Sale-8209-xGVn   Jeremy Spencer       832.05 1979-05-22  02:10:06   
Sale-9093-bfcp  Pamela Anderson       166.57 1973-12-24  21:28:34   
Sale-8141-KGOb    Kenneth Marsh       498.43 1974-03-14  03:55:36   

                     credit_card             date_full   timestamp   country  
code                                                                          
Sale-1117-HdZH  6011811575065598  2024-Jun-03 08:13:18  1717402398  Colombia  
Sale-5078-hqkc  6536303182814044  2014-Aug-22 02:57:20  1408676240  Colombia  
Sale-8209-xGVn   213185615148626  1979-May-22 02:10:06   296187006  Colombia  
Sale-9093-bfcp  3558512811558836  1973-Dec-24 21:28:34   125616514  Colombia  
Sale-8141-KGOb  2239583806605394  1974-Mar

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,date_full,timestamp,country
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,1717402398,Colombia
Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-Aug-22 02:57:20,1408676240,Colombia
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06,296187006,Colombia
Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-Dec-24 21:28:34,125616514,Colombia
Sale-7567-cCLb,William Smith,857.13,2016-07-04,18:26:22,3541411469135312,2016-Jul-04 18:26:22,1467656782,Colombia


In [26]:
# query a dataframe by column
clients = sales_df['client']
print(type(clients))
print(clients.head())

<class 'pandas.core.series.Series'>
code
Sale-1117-HdZH          Gary Meza
Sale-5078-hqkc       Carol Martin
Sale-8209-xGVn     Jeremy Spencer
Sale-9093-bfcp    Pamela Anderson
Sale-8141-KGOb      Kenneth Marsh
Name: client, dtype: object


In [27]:
# query a dataframe by row with loc
sales_df.loc['Sale-8209-xGVn', ['client', 'date']]


client         Jeremy Spencer
date      1979-05-22 00:00:00
Name: Sale-8209-xGVn, dtype: object

In [28]:
# query a dataframe by row with iloc
sales_df.iloc[2:5]

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,date_full,timestamp
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06,296187006
Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-Dec-24 21:28:34,125616514
Sale-8141-KGOb,Kenneth Marsh,498.43,1974-03-14,03:55:36,2239583806605394,1974-Mar-14 03:55:36,132465336


In [29]:
# query a dataframe using a boolean mask
sales_df[sales_df['total_price'] > 600].head()

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,date_full,timestamp
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,1717402398
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06,296187006
Sale-7567-cCLb,William Smith,857.13,2016-07-04,18:26:22,3541411469135312,2016-Jul-04 18:26:22,1467656782
Sale-1256-GSGV,Travis Reid,913.5,1987-01-01,12:19:45,30435866051594,1987-Jan-01 12:19:45,536501985
Sale-2412-paxo,James Harris,620.8,1970-09-29,09:31:03,4619378591392526336,1970-Sep-29 09:31:03,23448663


In [30]:
sales_df[(sales_df['total_price'] > 600) & (sales_df['date'] > '2016-01-01')].head()

Unnamed: 0_level_0,client,total_price,date,hour,credit_card,date_full,timestamp
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,1717402398
Sale-7567-cCLb,William Smith,857.13,2016-07-04,18:26:22,3541411469135312,2016-Jul-04 18:26:22,1467656782
Sale-1938-dqto,Christopher Green DDS,994.77,2023-09-19,12:54:28,4569498359626448,2023-Sep-19 12:54:28,1695128068
Sale-5358-yFiM,Gene Davis,871.66,2023-06-18,18:04:39,36076121497968,2023-Jun-18 18:04:39,1687111479
Sale-8546-Mwsg,William Payne,915.85,2019-09-25,02:57:57,30108709853944,2019-Sep-25 02:57:57,1569380277


In [31]:
# change all column names to Capital Case
sales_df.columns = sales_df.columns.str.capitalize()
sales_df.head()

Unnamed: 0_level_0,Client,Total_price,Date,Hour,Credit_card,Date_full,Timestamp
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,1717402398
Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-Aug-22 02:57:20,1408676240
Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06,296187006
Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-Dec-24 21:28:34,125616514
Sale-8141-KGOb,Kenneth Marsh,498.43,1974-03-14,03:55:36,2239583806605394,1974-Mar-14 03:55:36,132465336


In [32]:
print(sales_df.head())
sales_df = sales_df.reset_index()
sales_df.head()

                         Client  Total_price       Date      Hour  \
code                                                                
Sale-1117-HdZH        Gary Meza       832.48 2024-06-03  08:13:18   
Sale-5078-hqkc     Carol Martin       156.29 2014-08-22  02:57:20   
Sale-8209-xGVn   Jeremy Spencer       832.05 1979-05-22  02:10:06   
Sale-9093-bfcp  Pamela Anderson       166.57 1973-12-24  21:28:34   
Sale-8141-KGOb    Kenneth Marsh       498.43 1974-03-14  03:55:36   

                     Credit_card             Date_full   Timestamp  
code                                                                
Sale-1117-HdZH  6011811575065598  2024-Jun-03 08:13:18  1717402398  
Sale-5078-hqkc  6536303182814044  2014-Aug-22 02:57:20  1408676240  
Sale-8209-xGVn   213185615148626  1979-May-22 02:10:06   296187006  
Sale-9093-bfcp  3558512811558836  1973-Dec-24 21:28:34   125616514  
Sale-8141-KGOb  2239583806605394  1974-Mar-14 03:55:36   132465336  


Unnamed: 0,code,Client,Total_price,Date,Hour,Credit_card,Date_full,Timestamp
0,Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,1717402398
1,Sale-5078-hqkc,Carol Martin,156.29,2014-08-22,02:57:20,6536303182814044,2014-Aug-22 02:57:20,1408676240
2,Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06,296187006
3,Sale-9093-bfcp,Pamela Anderson,166.57,1973-12-24,21:28:34,3558512811558836,1973-Dec-24 21:28:34,125616514
4,Sale-8141-KGOb,Kenneth Marsh,498.43,1974-03-14,03:55:36,2239583806605394,1974-Mar-14 03:55:36,132465336


In [33]:
# get availables values in a column
print(sales_df['Date'].size)
print(sales_df['Date'].unique())

1000
<DatetimeArray>
['2024-06-03 00:00:00', '2014-08-22 00:00:00', '1979-05-22 00:00:00',
 '1973-12-24 00:00:00', '1974-03-14 00:00:00', '2016-07-04 00:00:00',
 '2010-07-18 00:00:00', '1988-11-18 00:00:00', '1987-01-01 00:00:00',
 '2017-11-06 00:00:00',
 ...
 '2012-10-15 00:00:00', '1994-02-06 00:00:00', '2003-08-16 00:00:00',
 '2018-09-26 00:00:00', '2007-12-19 00:00:00', '2002-10-24 00:00:00',
 '1999-06-29 00:00:00', '1991-11-27 00:00:00', '1979-07-07 00:00:00',
 '1972-03-07 00:00:00']
Length: 981, dtype: datetime64[ns]


In [34]:
# query a dataframe using query()
sales_df.query('Total_price > 600').head()

Unnamed: 0,code,Client,Total_price,Date,Hour,Credit_card,Date_full,Timestamp
0,Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,1717402398
2,Sale-8209-xGVn,Jeremy Spencer,832.05,1979-05-22,02:10:06,213185615148626,1979-May-22 02:10:06,296187006
5,Sale-7567-cCLb,William Smith,857.13,2016-07-04,18:26:22,3541411469135312,2016-Jul-04 18:26:22,1467656782
8,Sale-1256-GSGV,Travis Reid,913.5,1987-01-01,12:19:45,30435866051594,1987-Jan-01 12:19:45,536501985
12,Sale-2412-paxo,James Harris,620.8,1970-09-29,09:31:03,4619378591392526336,1970-Sep-29 09:31:03,23448663


In [35]:
sales_df.query('Total_price > 600 and Date > "2016-01-01"').head()

Unnamed: 0,code,Client,Total_price,Date,Hour,Credit_card,Date_full,Timestamp
0,Sale-1117-HdZH,Gary Meza,832.48,2024-06-03,08:13:18,6011811575065598,2024-Jun-03 08:13:18,1717402398
5,Sale-7567-cCLb,William Smith,857.13,2016-07-04,18:26:22,3541411469135312,2016-Jul-04 18:26:22,1467656782
18,Sale-1938-dqto,Christopher Green DDS,994.77,2023-09-19,12:54:28,4569498359626448,2023-Sep-19 12:54:28,1695128068
59,Sale-5358-yFiM,Gene Davis,871.66,2023-06-18,18:04:39,36076121497968,2023-Jun-18 18:04:39,1687111479
81,Sale-8546-Mwsg,William Payne,915.85,2019-09-25,02:57:57,30108709853944,2019-Sep-25 02:57:57,1569380277


In [36]:
# get missing values using isnull()
call_center_df = pd.read_json('json-files/call_center_comments.json')
print(call_center_df.info())
print("*"*50)
print(call_center_df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column             Non-Null Count    Dtype         
---  ------             --------------    -----         
 0   code               1000000 non-null  object        
 1   client             1000000 non-null  object        
 2   product            1000000 non-null  object        
 3   date_time          979847 non-null   datetime64[ns]
 4   attention_time     950263 non-null   float64       
 5   comment            1000000 non-null  object        
 6   country_of_origin  1000000 non-null  object        
 7   city               858525 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(6)
memory usage: 61.0+ MB
None
**************************************************
code                      0
client                    0
product                   0
date_time             20153
attention_time        49737
comment                   0
country_of_origin      

In [37]:
# fill missing values using fillna()
# fill missing attention_time with average time
import numpy as np
call_center_filled_df = call_center_df.copy()
call_center_filled_df['attention_time'] = call_center_filled_df['attention_time'].fillna(np.mean(call_center_df['attention_time']))
print(call_center_filled_df.isnull().sum())

# fill missing city with not reported
call_center_filled_df['city'] = call_center_filled_df['city'].fillna('Not reported')
print(call_center_filled_df.isnull().sum())

# fill missing values using fillna() in date_time column with interopolation
call_center_filled_df['date_time'] = call_center_filled_df['date_time'].interpolate()
print(call_center_filled_df.isnull().sum())
print(call_center_filled_df.info())
call_center_filled_df.head()


code                      0
client                    0
product                   0
date_time             20153
attention_time            0
comment                   0
country_of_origin         0
city                 141475
dtype: int64
code                     0
client                   0
product                  0
date_time            20153
attention_time           0
comment                  0
country_of_origin        0
city                     0
dtype: int64
code                 0
client               0
product              0
date_time            0
attention_time       0
comment              0
country_of_origin    0
city                 0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column             Non-Null Count    Dtype         
---  ------             --------------    -----         
 0   code               1000000 non-null  object        
 1   client             1000000 non-null  object       

Unnamed: 0,code,client,product,date_time,attention_time,comment,country_of_origin,city
0,eer-4490986,Laurie Wallace,SSD,2020-02-29 17:51:25,86.740331,Significantly improved my computer's boot time.,Brazil,Belo Horizonte
1,qDL-3585949,Craig Harris,Graphics Card,2023-10-26 10:12:43,77.534184,Runs cool and quiet under load.,USA,Not reported
2,sQo-4191674,Cody Romero,VR Headset,2022-08-07 02:52:48,104.754283,Immersive gaming and media experience.,Russia,Kazan
3,QAy-6767272,Renee Sanchez,Graphics Card,2020-03-21 19:23:23,86.052689,"Pricey, but worth it for the performance.",India,Kolkata
4,lAI-7220154,Anthony Collins,Smart Home Assistant,2021-01-08 14:00:46,35.819705,Concerns about privacy and data security.,Italy,Rome


In [38]:
# drop missing values using dropna()
cleaned_df = call_center_df.dropna()
print(cleaned_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 799390 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   code               799390 non-null  object        
 1   client             799390 non-null  object        
 2   product            799390 non-null  object        
 3   date_time          799390 non-null  datetime64[ns]
 4   attention_time     799390 non-null  float64       
 5   comment            799390 non-null  object        
 6   country_of_origin  799390 non-null  object        
 7   city               799390 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(6)
memory usage: 54.9+ MB
None


In [39]:
# transform column using to_datetime()
call_center_df['dt_str'] = call_center_df['date_time'].astype(str)
print(call_center_df.info())
call_center_df['dt'] = pd.to_datetime(call_center_df['dt_str'])
print(call_center_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 9 columns):
 #   Column             Non-Null Count    Dtype         
---  ------             --------------    -----         
 0   code               1000000 non-null  object        
 1   client             1000000 non-null  object        
 2   product            1000000 non-null  object        
 3   date_time          979847 non-null   datetime64[ns]
 4   attention_time     950263 non-null   float64       
 5   comment            1000000 non-null  object        
 6   country_of_origin  1000000 non-null  object        
 7   city               858525 non-null   object        
 8   dt_str             1000000 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(7)
memory usage: 68.7+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column             Non-Null Count    Dtype         
---  ------     

In [40]:
# transform column using to_numeric()
call_center_df['at_str'] = call_center_df['attention_time'].astype(str)
print(call_center_df.info())
call_center_df['at'] = pd.to_numeric(call_center_df['at_str'], errors='coerce')
print(call_center_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 11 columns):
 #   Column             Non-Null Count    Dtype         
---  ------             --------------    -----         
 0   code               1000000 non-null  object        
 1   client             1000000 non-null  object        
 2   product            1000000 non-null  object        
 3   date_time          979847 non-null   datetime64[ns]
 4   attention_time     950263 non-null   float64       
 5   comment            1000000 non-null  object        
 6   country_of_origin  1000000 non-null  object        
 7   city               858525 non-null   object        
 8   dt_str             1000000 non-null  object        
 9   dt                 979847 non-null   datetime64[ns]
 10  at_str             1000000 non-null  object        
dtypes: datetime64[ns](2), float64(1), object(8)
memory usage: 83.9+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 t

In [41]:
# convert column to category using astype()
call_center_df = call_center_df.drop(columns=['dt_str', 'at_str'])
del call_center_df['dt']
del call_center_df['at']
call_center_df = call_center_df.rename(columns={'country_of_origin': 'country'})
call_center_df['country'] = call_center_df['country'].astype('category')
call_center_df['city'] = call_center_df['city'].astype('category')
print(call_center_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   code            1000000 non-null  object        
 1   client          1000000 non-null  object        
 2   product         1000000 non-null  object        
 3   date_time       979847 non-null   datetime64[ns]
 4   attention_time  950263 non-null   float64       
 5   comment         1000000 non-null  object        
 6   country         1000000 non-null  category      
 7   city            858525 non-null   category      
dtypes: category(2), datetime64[ns](1), float64(1), object(4)
memory usage: 47.7+ MB
None


### Students Info for Merge and Joining

In [42]:
df_csv.columns = df_csv.columns.str.lower()
df_csv.head()

Unnamed: 0,name,company,position,salary
0,Alice Johnson,"Hernandez, Griffith and Nelson",Petroleum engineer,4740
1,David Jones,Gomez-Garcia,"Geologist, engineering",73329
2,Eva Brown,Blevins LLC,Microbiologist,83245
3,Frank Davis,Greene-Wilson,Museum education officer,74390
4,Jack Anderson,Butler PLC,"Scientist, research (maths)",69851


In [43]:
df_json.head()

Unnamed: 0,id,name,career,college
0,1,Alice Johnson,Computer Science,Tech University
1,2,Bob Smith,Mechanical Engineering,Engineering Institute
2,3,Carol Williams,Electrical Engineering,Tech University
3,4,David Jones,Biology,Science College
4,5,Eva Brown,Physics,Tech University


In [44]:
# merge dataframes using merge()
students_merge_df = pd.merge(df_csv, df_json, on='name')
students_merge_df.to_csv('csv-files/students_merge.csv', index=False)
students_merge_df.head()

Unnamed: 0,name,company,position,salary,id,career,college
0,Alice Johnson,"Hernandez, Griffith and Nelson",Petroleum engineer,4740,1,Computer Science,Tech University
1,David Jones,Gomez-Garcia,"Geologist, engineering",73329,4,Biology,Science College
2,Eva Brown,Blevins LLC,Microbiologist,83245,5,Physics,Tech University
3,Frank Davis,Greene-Wilson,Museum education officer,74390,6,Chemistry,Science College
4,Jack Anderson,Butler PLC,"Scientist, research (maths)",69851,10,Software Engineering,Tech University


In [45]:
# concatenate dataframes using concat()
students_concat_df = pd.concat([df_csv, df_json], axis=0)
students_concat_df.to_csv('csv-files/students_concat.csv', index=False)
students_concat_df.head()

Unnamed: 0,name,company,position,salary,id,career,college
0,Alice Johnson,"Hernandez, Griffith and Nelson",Petroleum engineer,4740.0,,,
1,David Jones,Gomez-Garcia,"Geologist, engineering",73329.0,,,
2,Eva Brown,Blevins LLC,Microbiologist,83245.0,,,
3,Frank Davis,Greene-Wilson,Museum education officer,74390.0,,,
4,Jack Anderson,Butler PLC,"Scientist, research (maths)",69851.0,,,


In [46]:
temp_concat_df = pd.concat([df_csv, df_csv.copy()], axis=0).reset_index(drop=True)
print(temp_concat_df.info())
print("*"*50)
temp_concat_df = temp_concat_df.drop_duplicates()
print(temp_concat_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      60 non-null     object
 1   company   60 non-null     object
 2   position  60 non-null     object
 3   salary    60 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 2.0+ KB
None
**************************************************
<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      30 non-null     object
 1   company   30 non-null     object
 2   position  30 non-null     object
 3   salary    30 non-null     int64 
dtypes: int64(1), object(3)
memory usage: 1.2+ KB
None


In [47]:
df_csv = df_csv.set_index('name')
df_json = df_json.set_index('name')

In [48]:
# join dataframes using join()
students_join_df = df_csv.join(df_json, how='inner')
students_join_df.to_csv('csv-files/students_join.csv')
students_join_df.head()

Unnamed: 0_level_0,company,position,salary,id,career,college
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alice Johnson,"Hernandez, Griffith and Nelson",Petroleum engineer,4740,1,Computer Science,Tech University
David Jones,Gomez-Garcia,"Geologist, engineering",73329,4,Biology,Science College
Eva Brown,Blevins LLC,Microbiologist,83245,5,Physics,Tech University
Frank Davis,Greene-Wilson,Museum education officer,74390,6,Chemistry,Science College
Jack Anderson,Butler PLC,"Scientist, research (maths)",69851,10,Software Engineering,Tech University


### Coming Back to the CallCenter

In [49]:
# group dataframes using groupby()
# groupby by country
for country, country_df in call_center_df.groupby('country'):
    print(country)
    print(country_df.head())

  for country, country_df in call_center_df.groupby('country'):


Australia
           code           client           product           date_time  \
7   iQI-5637711  Richard Herrera  Wireless Charger                 NaT   
34  duD-2082213    Jamie Estrada   Streaming Stick 2022-01-29 18:05:49   
40  aLz-4866545     James Nelson    Digital Camera 2024-05-20 03:33:30   
58  aNx-4547164    Michael Davis           Monitor 2021-12-09 20:56:03   
69  wiz-4312029  Jennifer Murray        3D Printer 2024-05-29 17:10:56   

    attention_time                               comment    country      city  
7        90.696875  Charging speed is slower than wired.  Australia    Sydney  
34       46.029668         Turns any TV into a smart TV.  Australia  Canberra  
40      146.931571   Compact and easy to carry on trips.  Australia  Canberra  
58       77.915983       Stand could be more adjustable.  Australia  Adelaide  
69       11.571618            Material costs can add up.  Australia     Perth  
Brazil
           code               client              product 

In [50]:
# groupby by country and city
for (country, city), country_city_df in call_center_df.groupby(['country', 'city']):
    print(country, city)
    print(country_city_df.head())

  for (country, city), country_city_df in call_center_df.groupby(['country', 'city']):


Australia Adelaide
            code          client            product           date_time  \
58   aNx-4547164   Michael Davis            Monitor 2021-12-09 20:56:03   
82   AxV-5493389      Amy Dillon      Graphics Card 2024-02-09 10:56:23   
251  guI-8126215      Corey Gray             Tablet 2023-11-02 11:39:20   
285  EzZ-7506837  Dylan Price MD   Wireless Charger 2021-12-28 09:58:18   
481  TuI-4167846  Brittany Hardy  Bluetooth Speaker 2023-04-02 11:27:15   

     attention_time                                    comment    country  \
58        77.915983            Stand could be more adjustable.  Australia   
82       137.804867  Pricey, but worth it for the performance.  Australia   
251      136.081584            Light and easy to carry around.  Australia   
285       57.067966       Charging speed is slower than wired.  Australia   
481       28.677439          Great sound quality for its size.  Australia   

         city  
58   Adelaide  
82   Adelaide  
251  Adelaide  
285

In [51]:
# group and aggregate dataframes using groupby() and aggregate()
# grup by country and get the mean, std, min, max  of attention_time
grouped_country_df = call_center_df.groupby('country').agg({'attention_time': ['mean', 'std', np.min, np.nanmax]})
grouped_country_df

  grouped_country_df = call_center_df.groupby('country').agg({'attention_time': ['mean', 'std', np.min, np.nanmax]})
  grouped_country_df = call_center_df.groupby('country').agg({'attention_time': ['mean', 'std', np.min, np.nanmax]})
  grouped_country_df = call_center_df.groupby('country').agg({'attention_time': ['mean', 'std', np.min, np.nanmax]})


Unnamed: 0_level_0,attention_time,attention_time,attention_time,attention_time
Unnamed: 0_level_1,mean,std,min,nanmax
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Australia,77.564133,41.825168,5.002475,149.999597
Brazil,77.756712,41.852127,5.000688,149.999462
Canada,77.474514,41.881525,5.00191,149.996124
China,77.535011,41.908777,5.002371,149.999675
France,77.659527,41.936573,5.001077,149.998944
Germany,77.304183,41.765321,5.00039,149.997648
India,77.575955,41.794698,5.00528,149.995586
Italy,77.408117,41.829431,5.003875,149.996914
Japan,77.500571,41.858569,5.000739,149.999943
Mexico,77.61292,41.829822,5.000968,149.999955


In [52]:
# group by country and city and get the mean, std, min, max of attention_time
grouped_country_city_df = call_center_df.groupby(['country', 'city']).agg({'attention_time': ['mean', 'std', np.min, np.nanmax]})
grouped_country_city_df = grouped_country_city_df.dropna()
grouped_country_city_df.to_csv('csv-files/grouped_country_city.csv')
grouped_country_city_df

  grouped_country_city_df = call_center_df.groupby(['country', 'city']).agg({'attention_time': ['mean', 'std', np.min, np.nanmax]})
  grouped_country_city_df = call_center_df.groupby(['country', 'city']).agg({'attention_time': ['mean', 'std', np.min, np.nanmax]})
  grouped_country_city_df = call_center_df.groupby(['country', 'city']).agg({'attention_time': ['mean', 'std', np.min, np.nanmax]})


Unnamed: 0_level_0,Unnamed: 1_level_0,attention_time,attention_time,attention_time,attention_time
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,min,nanmax
country,city,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Australia,Adelaide,77.709053,41.633533,5.005949,149.983077
Australia,Brisbane,77.662205,42.006778,5.012611,149.994752
Australia,Canberra,76.959055,41.991513,5.008690,149.996272
Australia,Melbourne,77.809943,41.732739,5.026245,149.991688
Australia,Perth,77.996317,41.843446,5.023507,149.940369
...,...,...,...,...,...
USA,Los Angeles,77.425662,41.674721,5.002356,149.980056
USA,Miami,77.615529,41.591648,5.006401,149.984026
USA,New York,77.599644,42.025886,5.000887,149.976766
USA,San Francisco,77.381475,41.738286,5.019476,149.982310


In [53]:
grouped_country_city_df = grouped_country_city_df.reset_index()
grouped_country_city_df.head()

Unnamed: 0_level_0,country,city,attention_time,attention_time,attention_time,attention_time
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,std,min,nanmax
0,Australia,Adelaide,77.709053,41.633533,5.005949,149.983077
1,Australia,Brisbane,77.662205,42.006778,5.012611,149.994752
2,Australia,Canberra,76.959055,41.991513,5.00869,149.996272
3,Australia,Melbourne,77.809943,41.732739,5.026245,149.991688
4,Australia,Perth,77.996317,41.843446,5.023507,149.940369


In [54]:
# group and transform dataframes using groupby() and transform()
call_center_df['attention_time_mean'] = call_center_df.groupby('country')['attention_time'].transform(np.mean)
call_center_df['attention_time_std'] = call_center_df.groupby('country')['attention_time'].transform('std')
call_center_df = call_center_df.sort_values(by='country')
print(call_center_df.shape)
call_center_df.head()

  call_center_df['attention_time_mean'] = call_center_df.groupby('country')['attention_time'].transform(np.mean)
  call_center_df['attention_time_mean'] = call_center_df.groupby('country')['attention_time'].transform(np.mean)
  call_center_df['attention_time_std'] = call_center_df.groupby('country')['attention_time'].transform('std')


(1000000, 10)


Unnamed: 0,code,client,product,date_time,attention_time,comment,country,city,attention_time_mean,attention_time_std
999999,XRh-5634243,Christopher Harmon,Streaming Stick,2021-05-25 09:54:38,11.09984,Turns any TV into a smart TV.,Australia,Canberra,77.564133,41.825168
70627,BYO-8699106,Richard Watkins,Noise Cancelling Headphones,2020-09-30 05:59:40,63.467884,Sound quality is top-notch.,Australia,Brisbane,77.564133,41.825168
504922,uFw-2994564,Robert Juarez,Smartphone,2020-02-26 15:57:40,5.258973,Wish it had more storage options.,Australia,Adelaide,77.564133,41.825168
504923,zyz-1962524,Billy Bradley,Smart Thermostat,2023-10-23 04:09:19,60.87798,Saves on heating and cooling costs.,Australia,Perth,77.564133,41.825168
504934,hkc-4959632,Robert Walker,Smart Watch,2020-02-19 07:38:18,67.007363,Tracks fitness activity accurately.,Australia,Melbourne,77.564133,41.825168


In [55]:
print(call_center_df['attention_time_mean'].unique())

[77.5641334  77.75671188 77.47451364 77.53501086 77.65952717 77.30418293
 77.57595523 77.40811726 77.50057096 77.61291962 77.58733855 77.53423334
 77.57177674 77.5759291  77.3529239 ]


In [56]:
# group and filter dataframes using groupby() and filter()
call_center_df_filtered = call_center_df.groupby('country').filter(lambda x: x['attention_time'].mean() > 77.6)
print(call_center_df_filtered.shape)
print(call_center_df_filtered['attention_time_mean'].unique())
call_center_df_filtered.head()

  call_center_df_filtered = call_center_df.groupby('country').filter(lambda x: x['attention_time'].mean() > 77.6)


(199572, 10)
[77.75671188 77.65952717 77.61291962]


Unnamed: 0,code,client,product,date_time,attention_time,comment,country,city,attention_time_mean,attention_time_std
324875,MEF-3764563,Mikayla Peterson,Graphics Card,2023-12-01 12:38:19,57.773991,Runs cool and quiet under load.,Brazil,Salvador,77.756712,41.852127
357017,dWa-4153507,Tara Lowe,Smartphone,2020-10-31 00:27:43,36.069222,Wish it had more storage options.,Brazil,Rio de Janeiro,77.756712,41.852127
911099,bmB-3623946,Traci Snow,Desktop Computer,2023-03-28 18:56:43,72.48466,Takes up a lot of space.,Brazil,Salvador,77.756712,41.852127
357004,yzW-1913985,Kimberly Sutton,Bluetooth Speaker,2021-02-09 17:06:23,85.606281,Water-resistant feature is a plus.,Brazil,São Paulo,77.756712,41.852127
333155,Srj-1131939,Robert Lewis,Tablet,2020-05-19 14:11:58,44.727984,Great for reading and streaming videos.,Brazil,Belo Horizonte,77.756712,41.852127


In [63]:
# pivot dataframes using pivot_table()
def create_category(x):
    if x > 120:
        return 'Too Bad'
    elif x > 60:
        return 'Bad'
    elif x > 20:
        return 'Medium'
    else:
        return 'Acceptable'

call_center_df['attention_category'] = call_center_df['attention_time'].apply(lambda x: create_category(x))
call_center_df.to_csv('csv-files/call_center_data.csv', index=False)
call_center_df.pivot_table(values='attention_time', index='country', columns='attention_category', aggfunc=['mean', 'min', 'max'], observed=False).head()

Unnamed: 0_level_0,mean,mean,mean,mean,min,min,min,min,max,max,max,max
attention_category,Acceptable,Bad,Medium,Too Bad,Acceptable,Bad,Medium,Too Bad,Acceptable,Bad,Medium,Too Bad
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Australia,12.522477,89.954171,40.200469,134.997447,5.002475,60.003472,20.001165,120.00118,19.998346,119.997908,59.999344,149.999597
Brazil,12.522218,89.99797,39.992445,134.959573,5.000688,60.000151,20.002189,120.003021,19.999621,119.998658,59.996379,149.999462
Canada,12.595325,90.104672,40.030873,135.007696,5.00191,60.001768,20.000494,120.003948,19.999028,119.999962,59.996309,149.996124
China,12.599466,89.988982,40.081815,135.117533,5.002371,60.001863,20.000859,120.001723,19.999311,119.997419,59.997002,149.999675
France,12.562566,90.026974,39.90919,134.905416,5.001077,60.000012,20.007444,120.00543,19.99904,119.998133,59.994651,149.998944


## Advanced Transformations

In [58]:
# making transformations using apply()
def get_region(country):
    if country in ['Brazil']:
        return 'South America'
    elif country in ['USA', 'Canada', 'Mexico']:
        return 'North America'
    elif country in ['Spain', 'France', 'UK', 'Germany', 'Italy']:
        return 'Europe'
    elif country in ['China', 'South Korea', 'India', 'Japan']:
        return 'Asia'
    elif country in ['Russia']:
        return 'Asia/Europe'
    elif country in ['Australia']:
        return 'Oceania'
    else:
        return 'NA'

call_center_df['region'] = call_center_df['country'].apply(lambda x: get_region(x))
call_center_df.head()

Unnamed: 0,code,client,product,date_time,attention_time,comment,country,city,attention_time_mean,attention_time_std,attention_category,region
999999,XRh-5634243,Christopher Harmon,Streaming Stick,2021-05-25 09:54:38,11.09984,Turns any TV into a smart TV.,Australia,Canberra,77.564133,41.825168,Acceptable,Oceania
70627,BYO-8699106,Richard Watkins,Noise Cancelling Headphones,2020-09-30 05:59:40,63.467884,Sound quality is top-notch.,Australia,Brisbane,77.564133,41.825168,Bad,Oceania
504922,uFw-2994564,Robert Juarez,Smartphone,2020-02-26 15:57:40,5.258973,Wish it had more storage options.,Australia,Adelaide,77.564133,41.825168,Acceptable,Oceania
504923,zyz-1962524,Billy Bradley,Smart Thermostat,2023-10-23 04:09:19,60.87798,Saves on heating and cooling costs.,Australia,Perth,77.564133,41.825168,Bad,Oceania
504934,hkc-4959632,Robert Walker,Smart Watch,2020-02-19 07:38:18,67.007363,Tracks fitness activity accurately.,Australia,Melbourne,77.564133,41.825168,Bad,Oceania


In [59]:
# making transformations using chain transformations
# Assuming call_center_df is your DataFrame and get_region is a predefined function

transformed_df = (
    call_center_df
    .assign(
        attention_time=lambda x: x['attention_time'].fillna(x['attention_time'].mean()),
        city=lambda x: x['city'].astype(str).fillna('non-registered'),
        client=lambda x: x['client'].str.capitalize(),
        outlier=lambda x: (np.abs(x['attention_time_mean'] - x['attention_time']) >\
                           np.abs(x['attention_time_mean'] - x['attention_time_std'])).astype(int)
    )
    .dropna()
    .assign(region=lambda x: x['country'].apply(get_region).str.upper())
    .set_index('code')
    .round({'attention_time': 3})
    .rename(columns={'client': 'client_name'})
)

transformed_df.head()

Unnamed: 0_level_0,client_name,product,date_time,attention_time,comment,country,city,attention_time_mean,attention_time_std,attention_category,region,outlier
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
XRh-5634243,Christopher harmon,Streaming Stick,2021-05-25 09:54:38,11.1,Turns any TV into a smart TV.,Australia,Canberra,77.564133,41.825168,Acceptable,OCEANIA,1
BYO-8699106,Richard watkins,Noise Cancelling Headphones,2020-09-30 05:59:40,63.468,Sound quality is top-notch.,Australia,Brisbane,77.564133,41.825168,Bad,OCEANIA,0
uFw-2994564,Robert juarez,Smartphone,2020-02-26 15:57:40,5.259,Wish it had more storage options.,Australia,Adelaide,77.564133,41.825168,Acceptable,OCEANIA,1
zyz-1962524,Billy bradley,Smart Thermostat,2023-10-23 04:09:19,60.878,Saves on heating and cooling costs.,Australia,Perth,77.564133,41.825168,Bad,OCEANIA,0
hkc-4959632,Robert walker,Smart Watch,2020-02-19 07:38:18,67.007,Tracks fitness activity accurately.,Australia,Melbourne,77.564133,41.825168,Bad,OCEANIA,0


## Statistical Testing

In [60]:
# making a t-test with pandas and scipy
# t-test is a statistical test that is used to compare the means of two groups
from scipy.stats import ttest_ind

group1 = call_center_df['group1_scores']
group2 = call_center_df['group2_scores']

t_stat, p_value = ttest_ind(group1, group2)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

KeyError: 'group1_scores'

In [None]:
# making an ANOVA test with pandas and scipy
# ANOVA test is a statistical test that is used to compare the means of three or more groups
from scipy.stats import f_oneway

group1 = call_center_df[call_center_df['group_column'] == 'Group1']['scores_column']
group2 = call_center_df[call_center_df['group_column'] == 'Group2']['scores_column']
group3 = call_center_df[call_center_df['group_column'] == 'Group3']['scores_column']

f_stat, p_value = f_oneway(group1, group2, group3)  
print(f"F-statistic: {f_stat}, P-value: {p_value}")

In [None]:
# making a chi-square test with pandas and scipy
# chi-square test is a statistical test that is used to compare the frequency of two or more groups
from scipy.stats import chi2_contingency

call_center_df['column1'] = call_center_df['column1'].astype('category')
call_center_df['column2'] = call_center_df['column2'].astype('category')

contingency_table = pd.crosstab(call_center_df['column1'], call_center_df['column2'])
chi2, p, dof, expected = chi2_contingency(contingency_table)


In [None]:
# making a correlation validation with pandas 
# correlation is a statistical test that is used to measure the relationship between two variables
call_center_df.corr()

NameError: name 'call_center_df' is not defined

In [None]:
# p-hacking example


In [None]:
# p-hacking example with multiple testing


In [None]:
# p-value example


In [None]:
# p-value correction with Bonferroni
