## DATA STRUCTURES

We will start with the following two important data structures of Pandas:

- Series and
- DataFrame

### SERIES

A Series is a one-dimensional labelled array-like object. It is capable of holding any data type, e.g. integers, floats, strings, Python objects, and so on. 

It can be seen as a data structure with two arrays: one functioning as
the index, i.e. the labels, and the other one contains the actual data.

In [1]:
import pandas as pd # Let's import Pandas with pd.

In [2]:
fruits = ['apple','banana','grapes','mango','kiwi']

print(type(fruits))

# Converting list into a pandas series

S = pd.Series(fruits)
print(S)
print(type(S))

<class 'list'>
0     apple
1    banana
2    grapes
3     mango
4      kiwi
dtype: object
<class 'pandas.core.series.Series'>


In [3]:
fruits = dict(([1,'apple'],[2,'banana'],[3,'grapes'],[4,'mango'],[5,'kiwi']))
print(type(fruits))

S = pd.Series(fruits)
S

<class 'dict'>


1     apple
2    banana
3    grapes
4     mango
5      kiwi
dtype: object

In [14]:
fruits = ['apple','banana','grapes','mango','kiwi']
print(fruits)

# Changing defualt index in pandas series

S = pd.Series(fruits, index=['one','two','three','four','five'])
S

['apple', 'banana', 'grapes', 'mango', 'kiwi']


one       apple
two      banana
three    grapes
four      mango
five       kiwi
dtype: object

In [6]:
# Create a Series from a dictionary

# fruits(fname:sale)

fruits={'apple':200,'banana':60,'grapes':130,'mango':120,'kiwi':300}

# Making a Series

Fruits = pd.Series(fruits)
print(Fruits)

apple     200
banana     60
grapes    130
mango     120
kiwi      300
dtype: int64


In [2]:
# Creaing a Series

import pandas as pd

S = pd.Series([12,-4,7,9], index=['One','Two','Three','Four'])
print(S)

One      12
Two      -4
Three     7
Four      9
dtype: int64


**We can directly access the index and the values of our Series S:**

In [3]:
# Accessing index of the series
S.index

Index(['One', 'Two', 'Three', 'Four'], dtype='object')

In [4]:
# Accessing values of a Series

print(S.values)

[12 -4  7  9]


In [7]:
# Selecting one vlaue based on an index
S[S.index == 'Two']

Two   -4
dtype: int64

In [9]:
# Selecting multiple vlaues based on indeces

S[(S.index == 'Two') | (S.index == 'Four')]

Two    -4
Four    9
dtype: int64

In [17]:
import numpy as np

np_array1 = np.array([23,45,67,89])
print(np_array1)
print(np_array1[0])

[23 45 67 89]
23


**If we add two series with the same indices, we get a new series with the same index and the correponding values will be added:**

In [25]:
S1 = pd.Series([1,2,3,4])
S2 = pd.Series([12,-4,7,9])
print(S1+S2)

0    13
1    -2
2    10
3    13
dtype: int64


In [17]:
S3 = zip(S1,S2)

for i in S3:
    print(i)

('One', 12)
('Two', -4)
('Three', 7)
('Four', 9)


In [23]:
S3 = pd.concat([S1,S2],axis=1)
print(S3)
print(type(S3))

       0   1
0    One  12
1    Two  -4
2  Three   7
3   Four   9
<class 'pandas.core.frame.DataFrame'>


### INDEXING

It's possible to access single values of a Series.

In [26]:
print(S1)

0    1
1    2
2    3
3    4
dtype: int64


In [27]:
S1[[1,3]]

1    2
3    4
dtype: int64

In [30]:
S1[1::2]

1    2
3    4
dtype: int64

In [31]:
S1[1:3]

1    2
2    3
dtype: int64

In [35]:
# Create a series of 50 numbers from 501-550

S = pd.Series(list(range(501,551)))
#print(S)

# Select 501,504,505
print(S[[0,3,4]])

# Select 511,521,531,541
print(S[10::10])

0    501
3    504
4    505
dtype: int64
10    511
20    521
30    531
40    541
dtype: int64


### FILTERING WITH A BOOLEAN ARRAY

Similar to numpy arrays, we can filter Pandas Series with a Boolean array:

In [61]:
# Filterig using greater than comparision operator
print(list(S[S>525]))

[526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550]


In [62]:
# Filterig using less than comparision operator
print(list(S[S<525]))

[501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524]


In [39]:
# Filterig using equality
print(S[S==525])

24    525
dtype: int64


In [63]:
# Filterig using not equality
print(list(S[S!=525]))

[501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550]


In [46]:
# Create 3 series using S like first20 in S1, last20 in S2, remaining in S3

S1 = S[:20]
S3 = S[-20:]
print(list(S1))

[501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520]


In [47]:
print(list(S3))

[531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550]


In [48]:
S2 = S[20:30]
print(list(S2))

[521, 522, 523, 524, 525, 526, 527, 528, 529, 530]


In [49]:
# Create 2 lists using S, S1 = 501,503,505..., S2 = 502,504,506...

S1 = S[::2]
print(list(S1))

[501, 503, 505, 507, 509, 511, 513, 515, 517, 519, 521, 523, 525, 527, 529, 531, 533, 535, 537, 539, 541, 543, 545, 547, 549]


In [51]:
S2 = S[1::2]
print(list(S2))

[502, 504, 506, 508, 510, 512, 514, 516, 518, 520, 522, 524, 526, 528, 530, 532, 534, 536, 538, 540, 542, 544, 546, 548, 550]


In [56]:
# Create 2 series using S like, S1= first 25, S2= last 25

S1 = S[:len(S)//2]
S2 = S[len(S)//2:]
print(list(S1))

[501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525]


In [58]:
print(list(S2))

[526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550]


In [60]:
print(list(S[::-1]))

[550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 531, 530, 529, 528, 527, 526, 525, 524, 523, 522, 521, 520, 519, 518, 517, 516, 515, 514, 513, 512, 511, 510, 509, 508, 507, 506, 505, 504, 503, 502, 501]


### CREATING SERIES OBJECTS FROM DICTIONARIES

We can even use a dictionary to create a Series object. The resulting Series contains the dict's keys as the indices and the values as the values.

In [1]:
cities = {"London": 8615246,
"Berlin": 3562166,
"Madrid": 3165235,
"Rome": 2874038,
"Paris": 2273305,
"Vienna": 1805681,
"Bucharest": 1803425,
"Hamburg": 1760433,
"Budapest": 1754000,
"Warsaw": 1740119,
"Barcelona": 1602386,
"Munich": 1493900,
"Milan": 1350680}

import pandas as pd # pandas is a data analysis library

Cities = pd.Series(cities)
print(Cities)

London       8615246
Berlin       3562166
Madrid       3165235
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
dtype: int64


In [3]:
li1 = ["London","Berlin","Madrid"]
li2 = [2874038,2273305,1805681]

S = pd.Series(li2, index=li1)
S

London    2874038
Berlin    2273305
Madrid    1805681
dtype: int64

### NAN - MISSING DATA

![image.png](attachment:image.png)

In [9]:
# Create a pandas Series having one or multiple missing data

import numpy as np

S = pd.Series([23,45,np.NaN,67,89,np.NaN]) # NaN: Not a Number
print(S)

0    23.0
1    45.0
2     NaN
3    67.0
4    89.0
5     NaN
dtype: float64


### THE METHODS ISNULL() AND NOTNULL()

"NaN" stands for "not a number". It can also be seen as meaning "missing" in our example.

In [6]:
# Checkig for missing data

S.isnull().sum()

2

In [7]:
S.isna().sum()

2

In [8]:
# Dropping missing values

S.dropna()

0    23.0
1    45.0
3    67.0
4    89.0
dtype: float64

In [10]:
# Printing original Series

print(S)

0    23.0
1    45.0
2     NaN
3    67.0
4    89.0
5     NaN
dtype: float64


In [None]:
# Replacing a missing value with another value
# We can replace a missing value with following values:
# Average value
# Median value
# Next value to missing
# Previous value to missing

In [12]:
# replacing a missing value with a fixed number/data

S.fillna(999, inplace=True) # inplace
print(S)

0     23.0
1     45.0
2    999.0
3     67.0
4     89.0
5    999.0
dtype: float64


In [17]:
# Replacing a missing value with mean value

# Finding average

avg = S.mean()
print(round(avg,0))
S.fillna(avg, inplace=True)
print(S)

370.0
0     23.0
1     45.0
2    999.0
3     67.0
4     89.0
5    999.0
dtype: float64


In [3]:
import pandas as pd
import numpy as np

S = pd.Series([23,45,np.NaN,67,89,np.NaN]) # NaN: Not a Number
print(S)

S.fillna(S.mean(),inplace=True)
print(S)

0    23.0
1    45.0
2     NaN
3    67.0
4    89.0
5     NaN
dtype: float64
0    23.0
1    45.0
2    56.0
3    67.0
4    89.0
5    56.0
dtype: float64


In [4]:
(23+45+67+89)/4

56.0