# Introduction

Libraries are collections of modules that provide pre-written code for specific tasks.
They help developers avoid "reinventing the wheel" by offering reusable tools for things like data manipulation, scientific computing, machine learning, and more.
Libraries contain functions, classes, and objects that can be imported and used directly, saving time and effort when building applications.

**to install a library:**

```
%pip install pandas

import pandas as pd

from pandas import function
```

**to verify the version or install a specific version:**
```
pandas.__version__

%pip install pandas == 2.2.2
```

# Numpy

NumPy (Numerical Python) is the core library for scientific computing in Python. It offers support for multi-dimensional arrays and a vast collection of mathematical functions to operate on these arrays quickly and efficiently.

## Introduction

In [1]:
import numpy as np

In [2]:
# Arrays

list = [1, 2, 3, 4, 5]
npArray = np.array(list)
print(npArray)

matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
npMatrix = np.array(matrix)
print(npMatrix)

[1 2 3 4 5]
[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [3]:
print(sum(list))
# print(sum(matrix))  -> Error!!
np.sum(npMatrix)

15


np.int64(45)

## Creating

In [4]:
# Creating specific arrays

# Generates values from start to just before stop, spaced by a fixed step size.
# Use when you know the spacing between points.
print(np.arange(0, 10, 2)) # it works with float
print(range(0, 10, 2))

print('\nZeros')
print(np.zeros(3))
print(np.zeros([3, 2]))

print('\nOnes')
print(np.ones(3))

# Generates num evenly spaced values between start and stop (inclusive by default).
# Use when you want a specific number of points between two values.
print('\nLinear Space')
print(np.linspace(0, 10, 3))

print('\nEye')
print(np.eye(3))

print('\nRandom')
print(np.random.random(3)*10)
print(np.random.random([3, 2]))
print(np.random.randint(0, 10, 3))

[0 2 4 6 8]
range(0, 10, 2)

Zeros
[0. 0. 0.]
[[0. 0.]
 [0. 0.]
 [0. 0.]]

Ones
[1. 1. 1.]

Linear Space
[ 0.  5. 10.]

Eye
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Random
[1.8815426  5.89951283 4.80476977]
[[0.41919732 0.12889237]
 [0.71103519 0.24323015]
 [0.33743475 0.86106683]]
[7 9 4]


In [5]:
# Reshape
array = np.zeros(4)
array.shape

(4,)

In [6]:
array.reshape([2,2])

array([[0., 0.],
       [0., 0.]])

## Operations

In [7]:
print(np.max(npMatrix))
print(np.min(npMatrix))
print(np.mean(npMatrix))
print(np.std(npMatrix))
print(np.median(npMatrix))

9
1
5.0
2.581988897471611
5.0


In [8]:
arr1 = np.arange(0, 40, 10)
arr2 = np.arange(0, 20, 5)
print('Arr1:')
print(arr1)
print('Arr2:')
print(arr2)

# Sum
print('Sum')
print(arr1 + arr2)
print(np.add(arr1, arr2))
print('\n')

# Subtraction
print('Subtraction')
print(arr1 - arr2)
print(np.subtract(arr1, arr2))
print('\n')

# Munltiplicatio
print('Munltiplication')
print(arr1 * arr2)
print(np.multiply(arr1, arr2))
print('\n')

# Division
print('Division')
print(arr1 / arr2)
# print(np.divide(arr1, arr2))

Arr1:
[ 0 10 20 30]
Arr2:
[ 0  5 10 15]
Sum
[ 0 15 30 45]
[ 0 15 30 45]


Subtraction
[ 0  5 10 15]
[ 0  5 10 15]


Munltiplication
[  0  50 200 450]
[  0  50 200 450]


Division
[nan  2.  2.  2.]


  print(arr1 / arr2)


## Indexing and Slicing

In [9]:
# Arrays
arr = np.arange(0, 20, 2)
print(arr)
print(arr[3])
print(arr[1:5])
print(arr[:5])
print(arr[7:])
print(arr[[1,7]])

[ 0  2  4  6  8 10 12 14 16 18]
6
[2 4 6 8]
[0 2 4 6 8]
[14 16 18]
[ 2 14]


In [10]:
arr[9] = 100
print(arr)

[  0   2   4   6   8  10  12  14  16 100]


In [11]:
# Matrix
arr2d = np.arange(20).reshape([5,4])
arr2d

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [12]:
arr2d[1:3, 2:] # lines, columns

array([[ 6,  7],
       [10, 11]])

In [13]:
boolArr = arr2d > 10
boolArr

array([[False, False, False, False],
       [False, False, False, False],
       [False, False, False,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

In [14]:
arr2d[boolArr]

array([11, 12, 13, 14, 15, 16, 17, 18, 19])

In [15]:
boolArr = np.logical_and(arr2d > 5, arr2d < 15)
arr2d[boolArr]

array([ 6,  7,  8,  9, 10, 11, 12, 13, 14])

# Pandas

Pandas is a library focused on data manipulation and analysis. It introduces two main data structures—DataFrames and Series—which are highly optimized and make working with structured data intuitive and efficient.

## Introduction

In [16]:
import numpy as np
import pandas as pd

In [17]:
# Series are columns of the DataFrame

myList = [10, 20, 30]
myTuple = (10, 20, 30)
myDict = {'a': 10, 'b': 20, 'c': 30}
myArray = np.array(myList)

pd.Series(data=myList)

Unnamed: 0,0
0,10
1,20
2,30


In [18]:
pd.Series(data=myTuple)

Unnamed: 0,0
0,10
1,20
2,30


In [19]:
pd.Series(data=myDict)

Unnamed: 0,0
a,10
b,20
c,30


In [20]:
pd.Series(data=myArray)

Unnamed: 0,0
0,10
1,20
2,30


In [21]:
mySeries = pd.Series(data=myList, index=['1st', '2nd', '3rd'])
mySeries[['1st', '3rd']]

Unnamed: 0,0
1st,10
3rd,30


In [22]:
pd.Series(data=['John', 'Mary', 'Peter'])

Unnamed: 0,0
0,John
1,Mary
2,Peter


In [23]:
pd.Series(data=[sum, abs, len])

Unnamed: 0,0
0,<built-in function sum>
1,<built-in function abs>
2,<built-in function len>


## DataFrame

In [24]:
data = np.random.random([5, 5])
df = pd.DataFrame(data, index='1st 2nd 3rd 4th 5th'.split(), columns=['A', 'B', 'C', 'D', 'E'])
print(df, '\n')
print(df['A'], '\n')
print(type(df['A']), type(df))

            A         B         C         D         E
1st  0.528368  0.738398  0.018255  0.452534  0.506498
2nd  0.849502  0.988596  0.710465  0.460859  0.755039
3rd  0.169535  0.890142  0.331812  0.383594  0.775742
4th  0.755742  0.145537  0.699590  0.547301  0.240930
5th  0.194738  0.741928  0.852627  0.145033  0.411000 

1st    0.528368
2nd    0.849502
3rd    0.169535
4th    0.755742
5th    0.194738
Name: A, dtype: float64 

<class 'pandas.core.series.Series'> <class 'pandas.core.frame.DataFrame'>


In [25]:
df['new'] = [1, 2, 3, 4, 5]
df['new2'] = df['A'] + df['B']
df.new

Unnamed: 0,new
1st,1
2nd,2
3rd,3
4th,4
5th,5


### Slicing

In [26]:
# Use loc to select a line

df.loc['1st']

Unnamed: 0,1st
A,0.528368
B,0.738398
C,0.018255
D,0.452534
E,0.506498
new,1.0
new2,1.266767


In [27]:
print(df.loc['1st', 'A'])
print(df.loc[['1st', '2nd'], ['A', 'B']])

0.5283683769241694
            A         B
1st  0.528368  0.738398
2nd  0.849502  0.988596


In [28]:
# Use iloc to select a line by index
df.iloc[0]

Unnamed: 0,1st
A,0.528368
B,0.738398
C,0.018255
D,0.452534
E,0.506498
new,1.0
new2,1.266767


In [29]:
df.iloc[0:3, 1]

Unnamed: 0,B
1st,0.738398
2nd,0.988596
3rd,0.890142


## Other Operations

In [30]:
import numpy as np
import pandas as pd

data = {
    "id_user": [1, 2, 3, 4, 5],
    "item": ['computer', 'tv', 'computer', 'shirt', 'tv'],
    "price": [500, 200, 450, 100, 200],
    "state": ['SP', 'RJ', 'SP', 'RJ', 'RJ']
}

In [31]:
# head(): Returns the first few rows of a DataFrame.
df = pd.DataFrame(data)
df.head(2)

Unnamed: 0,id_user,item,price,state
0,1,computer,500,SP
1,2,tv,200,RJ


In [32]:
df.tail(1)

Unnamed: 0,id_user,item,price,state
4,5,tv,200,RJ


In [33]:
# unique(): Shows all the unique values in a Series.
df['item'].unique()

array(['computer', 'tv', 'shirt'], dtype=object)

In [34]:
df['item'].nunique()
# or len(df['item'].unique())

3

In [35]:
# value_counts(): Counts the frequency of each unique value in a Series.
df['item'].value_counts()

Unnamed: 0_level_0,count
item,Unnamed: 1_level_1
computer,2
tv,2
shirt,1


In [36]:
# apply(): Applies a function to each element (row or column) in a Series or DataFrame.
def test(price):
  if price > 200:
    return 'Expensive'
  else:
    return 'Cheap'

df['Test'] = df['price'].apply(test)
df

Unnamed: 0,id_user,item,price,state,Test
0,1,computer,500,SP,Expensive
1,2,tv,200,RJ,Cheap
2,3,computer,450,SP,Expensive
3,4,shirt,100,RJ,Cheap
4,5,tv,200,RJ,Cheap


In [37]:
df['item'].apply(lambda item: 'laptop' if item == 'computer' else 'other')

Unnamed: 0,item
0,laptop
1,other
2,laptop
3,other
4,other


In [38]:
df['item'].apply(lambda item: 'eletronic' if item in ['computer', 'tv'] else 'other')

Unnamed: 0,item
0,eletronic
1,eletronic
2,eletronic
3,other
4,eletronic


In [40]:
# index(): Displays or sets the row labels of a DataFrame.
print(df.index)
# list(df.index)

RangeIndex(start=0, stop=5, step=1)


In [41]:
# columns(): Displays or sets the column labels of a DataFrame.
df.columns

Index(['id_user', 'item', 'price', 'state', 'Test'], dtype='object')

In [42]:
# sort_values(): Sorts the DataFrame by the values in one or more columns.
df.sort_values(by='price', ascending = False)

Unnamed: 0,id_user,item,price,state,Test
0,1,computer,500,SP,Expensive
2,3,computer,450,SP,Expensive
1,2,tv,200,RJ,Cheap
4,5,tv,200,RJ,Cheap
3,4,shirt,100,RJ,Cheap


## Group by

In [43]:
# The most common functions to aggregate grouped data are:
# sum, mean, median, count, nunique, min, max, std and agg (for custom aggregation)

import numpy as np
import pandas as pd

data = {
    "Name": ["Ana", "Bruno", "Carla", "Daniel", "Eduardo", "Fernanda"],
    "Area": ["TI", "TI", "RH", "RH", "Vendas", "Vendas"],
    "Salary": [5000, 6000, 4000, 4200, 3000, 3500],
    "State": ["SP", "RJ", "SP", "RJ", "RJ", "SP"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Area,Salary,State
0,Ana,TI,5000,SP
1,Bruno,TI,6000,RJ
2,Carla,RH,4000,SP
3,Daniel,RH,4200,RJ
4,Eduardo,Vendas,3000,RJ
5,Fernanda,Vendas,3500,SP


In [44]:
df.groupby('Area')['Salary'].mean()

Unnamed: 0_level_0,Salary
Area,Unnamed: 1_level_1
RH,4100.0
TI,5500.0
Vendas,3250.0


In [45]:
df.groupby('Area')['Salary'].agg(['min', 'max'])

Unnamed: 0_level_0,min,max
Area,Unnamed: 1_level_1,Unnamed: 2_level_1
RH,4000,4200
TI,5000,6000
Vendas,3000,3500


In [46]:
df.groupby('State').agg({'Salary': ['mean', 'max', 'min', 'count', 'nunique'], 'Area': 'nunique'})

Unnamed: 0_level_0,Salary,Salary,Salary,Salary,Salary,Area
Unnamed: 0_level_1,mean,max,min,count,nunique,nunique
State,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
RJ,4400.0,6000,3000,3,3,3
SP,4166.666667,5000,3500,3,3,3


## Concat and Merge (Join)

In [47]:
users = {
    'user_id': [1, 2, 3, 4, 5],
    'name': ['Ana', 'Bruno', 'Carla', 'Daniel', 'Eduardo'],
    'city': ['SP', 'RJ', 'SP', 'RJ', 'RJ']
}

purchases = {
    'purchase_item': ['computer', 'computer', 'tv', 'shirt', 'tv', 'shirt', 'ring'],
    'user_id': [1, 2, 3, 5, 2, 4, 1],
    'price': [500, 450, 200, 100, 200, 150, 300]
}

newUsers = {
    'user_id': [6, 7, 8],
    'name': ['Fernanda', 'Gabriel', 'Helena']
}

dfUsers = pd.DataFrame(users)
dfPurchases = pd.DataFrame(purchases)
dfNewUsers = pd.DataFrame(newUsers)

In [48]:
allUsers = pd.concat([dfUsers, dfNewUsers])
print(allUsers)

   user_id      name city
0        1       Ana   SP
1        2     Bruno   RJ
2        3     Carla   SP
3        4    Daniel   RJ
4        5   Eduardo   RJ
0        6  Fernanda  NaN
1        7   Gabriel  NaN
2        8    Helena  NaN


In [49]:
allUsers.fillna({'city': 'unknown city'}, inplace = True)
print(allUsers)

   user_id      name          city
0        1       Ana            SP
1        2     Bruno            RJ
2        3     Carla            SP
3        4    Daniel            RJ
4        5   Eduardo            RJ
0        6  Fernanda  unknown city
1        7   Gabriel  unknown city
2        8    Helena  unknown city


In [50]:
"""
merge(): The most flexible method, used to combine two DataFrames based on one or more common columns (like a SQL JOIN).
You can specify the type of join: 'inner', 'outer', 'left', or 'right'.
"""

dfPurchases.merge(dfUsers, on='user_id', how='left')

Unnamed: 0,purchase_item,user_id,price,name,city
0,computer,1,500,Ana,SP
1,computer,2,450,Bruno,RJ
2,tv,3,200,Carla,SP
3,shirt,5,100,Eduardo,RJ
4,tv,2,200,Bruno,RJ
5,shirt,4,150,Daniel,RJ
6,ring,1,300,Ana,SP


## Import and Export Files

In [51]:
import pandas as pd
df = pd.read_csv('spotify_history.csv', sep = ',')
df

Unnamed: 0,spotify_track_uri,ts,platform,ms_played,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped
0,2J3n32GeLmMjwuAzyhcSNe,2013-07-08 02:44:34,web player,3185,"Say It, Just Say It",The Mowgli's,Waiting For The Dawn,autoplay,clickrow,False,False
1,1oHxIPqJyvAYHy0PVrDU98,2013-07-08 02:45:37,web player,61865,Drinking from the Bottle (feat. Tinie Tempah),Calvin Harris,18 Months,clickrow,clickrow,False,False
2,487OPlneJNni3NWC8SYqhW,2013-07-08 02:50:24,web player,285386,Born To Die,Lana Del Rey,Born To Die - The Paradise Edition,clickrow,unknown,False,False
3,5IyblF777jLZj1vGHG2UD3,2013-07-08 02:52:40,web player,134022,Off To The Races,Lana Del Rey,Born To Die - The Paradise Edition,trackdone,clickrow,False,False
4,0GgAAB0ZMllFhbNc3mAodO,2013-07-08 03:17:52,web player,0,Half Mast,Empire Of The Sun,Walking On A Dream,clickrow,nextbtn,False,False
...,...,...,...,...,...,...,...,...,...,...,...
149855,4Fz1WWr5o0OrlIcZxcyZtK,2024-12-15 23:06:19,android,1247,On The Way Home,John Mayer,Paradise Valley,fwdbtn,fwdbtn,True,True
149856,0qHMhBZqYb99yhX9BHcIkV,2024-12-15 23:06:21,android,1515,Magical Mystery Tour - Remastered 2009,The Beatles,Magical Mystery Tour,fwdbtn,fwdbtn,True,True
149857,0HHdujGjOZChTrl8lJWEIq,2024-12-15 23:06:22,android,1283,"Stop This Train - Live at the Nokia Theatre, L...",John Mayer,Where the Light Is: John Mayer Live In Los Ang...,fwdbtn,fwdbtn,True,True
149858,7peh6LUcdNPcMdrSH4JPsM,2024-12-15 23:06:23,android,1306,I Don't Trust Myself (With Loving You),John Mayer,Continuum,fwdbtn,fwdbtn,True,True


In [52]:
len(df)

149860

In [53]:
df.shape

(149860, 11)

In [54]:
df.count()

Unnamed: 0,0
spotify_track_uri,149860
ts,149860
platform,149860
ms_played,149860
track_name,149860
artist_name,149860
album_name,149860
reason_start,149717
reason_end,149743
shuffle,149860


In [55]:
df.isna().sum()

Unnamed: 0,0
spotify_track_uri,0
ts,0
platform,0
ms_played,0
track_name,0
artist_name,0
album_name,0
reason_start,143
reason_end,117
shuffle,0


In [56]:
df_album = df[['album_name', 'artist_name']].drop_duplicates()
df_album

Unnamed: 0,album_name,artist_name
0,Waiting For The Dawn,The Mowgli's
1,18 Months,Calvin Harris
2,Born To Die - The Paradise Edition,Lana Del Rey
4,Walking On A Dream,Empire Of The Sun
5,Impossible,James Arthur
...,...,...
149738,Hells Welles,Jesse Welles
149739,Patchwork,Jesse Welles
149743,Oo-De-Lally,Roger Miller
149744,King Of The Road,Roger Miller


In [57]:
df_album.to_csv('album.csv', sep = ",")

In [58]:
df_album.drop(columns=['artist_name'], inplace=True)
df_album

Unnamed: 0,album_name
0,Waiting For The Dawn
1,18 Months
2,Born To Die - The Paradise Edition
4,Walking On A Dream
5,Impossible
...,...
149738,Hells Welles
149739,Patchwork
149743,Oo-De-Lally
149744,King Of The Road


## Exercises

In [59]:
#1. Detect and treat NaN
import numpy as np
import pandas as pd

data = {'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan], 'C': [1, 2, 3]}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [60]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [61]:
df.dropna(thresh=2) # Remove lines with 2 or more NaN

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


In [62]:
df.dropna(axis=1) # 0 to remove lines (default), 1 to remove columns

Unnamed: 0,C
0,1
1,2
2,3


In [63]:
df.isna().sum()

Unnamed: 0,0
A,1
B,2
C,0


In [64]:
df.fillna(value=0)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,0.0,2
2,0.0,0.0,3


In [65]:
df['A'].fillna(value = "")
# or df.fillna({'A': ""})

Unnamed: 0,A
0,1.0
1,2.0
2,
