#  Introduction to Pandas

<h2> Outline<span class="tocSkip"></span></h2>
<hr>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Introduction-to-Pandas" data-toc-modified-id="1.-Introduction-to-Pandas-1">1. Introduction to Pandas</a></span></li><li><span><a href="#2.-Pandas-Series" data-toc-modified-id="2.-Pandas-Series-2">2. Pandas Series</a></span></li><li><span><a href="#3.-Pandas-DataFrames" data-toc-modified-id="3.-Pandas-DataFrames-3">3. Pandas DataFrames</a></span></li><li><span><a href="#4.-Why-ndarrays-and-Series-and-DataFrames?" data-toc-modified-id="4.-Why-ndarrays-and-Series-and-DataFrames?-4">4. Why ndarrays and Series and DataFrames?</a></span></li></ul></div>

## Chapter Learning Objectives
<hr>

- Create Pandas series with `pd.Series()` and Pandas dataframe with `pd.DataFrame()`
- Be able to access values from a Series/DataFrame by indexing, slicing and boolean indexing using notation such as `df[]`, `df.loc[]`, `df.iloc[]`, `df.query[]`
- Perform basic arithmetic operations between two series and anticipate the result.
- Describe how Pandas assigns dtypes to Series and what the `object` dtype is
- Read a standard .csv file from a local path or url using Pandas `pd.read_csv()`.
- Explain the relationship and differences between `np.ndarray`, `pd.Series` and `pd.DataFrame` objects in Python.

## 1. Introduction to Pandas
<hr>

Pandas is most popular Python library for tabular data structures. You can think of Pandas as an extremely powerful version of Excel (but free and with a lot more features!)

Pandas can be installed using `conda`:

```
conda install pandas
```

We usually import pandas with the alias `pd`. You'll see these two imports at the top of most data science workflows:

In [None]:
! install pandas

In [None]:
import pandas as pd
import numpy as np

## 2. Pandas Series
<hr>

### What are Series?

A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them. Series can be created from a scalar, a list, ndarray or dictionary using `pd.Series()` (**note the captial "S"**). Here are some example series:

![](img/chapter7/series.png)

### Creating Series

By default, series are labelled with indices starting from 0. For example:

In [None]:
pd.Series(data = [3, 3.5, "Erum"])

0       3
1     3.5
2    Erum
dtype: object

In [None]:
pd.Series(data = [3, 7.7, "Yusra",3,7], index = ["a", "b", "c","d", "e"])

a        3
b      7.7
c    Yusra
d        3
e        7
dtype: object

In [None]:
pd.Series(data = ["Musa", "Yessi", "Japhet", "Magdalena", "Kiko", "Dickson"])

0         Musa
1        Yessi
2       Japhet
3    Magdalena
4         Kiko
5      Dickson
dtype: object

#ASSIGNMENT 6. Task1.
Write 10 names of students with custom Index

In [None]:
pd.Series(data = [-5, 1.3, 21, 6, 3,"erum"])

0      -5
1     1.3
2      21
3       6
4       3
5    erum
dtype: object

But you can add a custom index:

In [None]:
pd.Series(data = [-5, 1.3, 21, 6, 3,"erum"],
          index = ['OmdSc1', 'OmdSc2', 'c', 'd', 'e', 'f'])

OmdSc1      -5
OmdSc2     1.3
c           21
d            6
e            3
f         erum
dtype: object

You can create a Series from a dictionary:

# Dictionary
In dictionary we have key and value pair(k,v) of data entry e,g.
Name: Erum , Student_ID: 20245, Email: erum@omdena.com

In [None]:
dic1 = {"Name": "Erum", "Email": "erum@omdena.com", "Marks" : 70} # (Key: value)
print(dic1)

{'Name': 'Erum', 'Email': 'erum@omdena.com', 'Marks': 70}


In [None]:
Info_std = pd.Series(data = {"Name":  "Erum" , "Student_ID": 20245, "Email": "erum@omdena.com"})
Info_std

Name                     Erum
Student_ID              20245
Email         erum@omdena.com
dtype: object

In [None]:
pd.Series(data = {'a': 10, 'b': 20, 'c': 30})

a    10
b    20
c    30
dtype: int64

# NDArray
Or from an ndarray:

In [None]:
test = np.random.randn(10) # random array
test

array([-0.11960268,  0.07914931, -0.97273154, -1.33025285,  0.17055522,
        1.00244351,  0.41048076, -1.30184026,  0.0190745 , -0.68902862])

In [None]:
pd.Series(data = test)

0   -0.119603
1    0.079149
2   -0.972732
3   -1.330253
4    0.170555
5    1.002444
6    0.410481
7   -1.301840
8    0.019075
9   -0.689029
dtype: float64

In [None]:
pd.Series(data = np.random.randn(10))

0    0.378685
1   -1.112932
2    0.191860
3    0.913784
4    3.268865
5   -0.804950
6   -1.238169
7   -0.699612
8   -0.487119
9   -1.448421
dtype: float64

Or even a scalar:

In [None]:
pd.Series(3.141)

0    3.141
dtype: float64

In [None]:
SoudData = pd.Series(data = "Sound", index = [1,2,3,4 ,5,6,7,8,9,10])
SoudData

1     Sound
2     Sound
3     Sound
4     Sound
5     Sound
6     Sound
7     Sound
8     Sound
9     Sound
10    Sound
dtype: object

In [None]:
SoudData.rename("testing")

1     Sound
2     Sound
3     Sound
4     Sound
5     Sound
6     Sound
7     Sound
8     Sound
9     Sound
10    Sound
Name: testing, dtype: object

In [None]:
pd.Series(data=3.141, index=['a', 'b', 'c', "d", "e"])

a    3.141
b    3.141
c    3.141
d    3.141
e    3.141
dtype: float64

### Series Characteristics

Series can be given a `name` attribute. I almost never use this but it might come up sometimes:

In [None]:
testing_series = pd.Series(data = np.random.randn(5), name='random_series')
testing_series

0    1.294766
1    0.073130
2   -1.347211
3   -0.527814
4   -0.892070
Name: random_series, dtype: float64

In [None]:
testing_series.name
testing_series.index

RangeIndex(start=0, stop=5, step=1)

In [None]:
testing_series.rename("new_name")

0    1.294766
1    0.073130
2   -1.347211
3   -0.527814
4   -0.892070
Name: new_name, dtype: float64

In [None]:
testing_series.index
SoudData.index

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64')

In [None]:
s.name

NameError: name 's' is not defined

In [None]:
testing_series.rename("another_name")

0    1.294766
1    0.073130
2   -1.347211
3   -0.527814
4   -0.892070
Name: another_name, dtype: float64

You can access the index labels of your series using the `.index` attribute:

In [None]:
testing_series.index

RangeIndex(start=0, stop=5, step=1)

#Converting Series to NumPyArray

You can access the underlying data array using `.to_numpy()`:

There are many operations we can perfom only on numeric dataset. For that we sometimes covert dataframe to NumPy

In [None]:
testing_series = testing_series

In [None]:
numpy_series = pd.Series(data = np.random.randn(5), name='random_series')
numpy_series

0   -1.188392
1    0.996157
2    2.023318
3    2.256911
4   -0.377414
Name: random_series, dtype: float64

In [None]:
numpy_series.to_numpy()

array([-1.18839155,  0.99615667,  2.02331771,  2.25691087, -0.37741354])

In [None]:
s.to_numpy()

array([-1.34258874, -0.60618284, -0.96911935,  0.82011352,  1.07822909])

In [None]:
SoudData.to_numpy()


array(['Sound', 'Sound', 'Sound', 'Sound', 'Sound', 'Sound', 'Sound',
       'Sound', 'Sound', 'Sound'], dtype=object)

In [None]:
pd.Series([[1, 2, 3], "b", 1]).to_numpy()

array([list([1, 2, 3]), 'b', 1], dtype=object)

# Revised Example of Indexing

In [None]:
marks = [10, 19,8,9,48,7,5,38,29]
print(marks) # all data


[10, 19, 8, 9, 48, 7, 5, 38, 29]


In [None]:

print(marks[2]) # any index (starting from 0)


8


In [None]:
# slice
print(marks[1:6]) # 6 will 5th index as index always ends with N-1

[19, 8, 9, 48, 7]


In [None]:
test = [5,8,9,30,20, 4.6, 9, 5.9,35]
print(test) # all values
print(test[1:5]) # slice
print(test[2]) # one value

[5, 8, 9, 30, 20, 4.6, 9, 5.9, 35]
[8, 9, 30, 20]
9


### Indexing and Slicing Series

Series are very much like ndarrays (in fact, series can be passed to most NumPy functions!). They can be indexed using square brackets `[ ]` and sliced using colon `:` notation:

In [None]:
s = pd.Series(data = range(5), index = ['A', 'B', 'C', 'D', 'E'])
s

A    0
B    1
C    2
D    3
E    4
dtype: int64

In [None]:
S = pd.Series(data = [0,1,7,8,9], index = ["a", "b", "c", "d", "e"])

In [None]:
S["d"]

8

In [None]:
S["e"]

9

In [None]:
S["b"]

1

In [None]:
S[[1, 2, 3]] # by default index starts with 0

b    1
c    7
d    8
dtype: int64

In [None]:
S[1: 3]

b    1
c    7
dtype: int64

Note above how array-based indexing and slicing also returns the series index.

Series are also like dictionaries, in that we can access values using index labels:

In [None]:
s["A"]

0

In [None]:
s[["B", "D", "C"]]

B    1
D    3
C    2
dtype: int64

In [None]:
# to print slices
S["c":"e"]

c    7
d    8
e    9
dtype: int64

In [None]:
"e" in S

True

In [None]:
"j" in S

False

Series do allow for non-unique indexing, but **be careful** because indexing operations won't return unique values:

Finally, we can also do boolean indexing with series:

In [None]:
S[S <= 8]

a    0
b    1
c    7
d    8
dtype: int64

In [None]:
S.mean()

5.0

In [None]:
S[S > S.mean()]

c    7
d    8
e    9
dtype: int64

In [None]:
s[s > s.sum()]

Series([], dtype: int64)

In [None]:
(s == 1)

A    False
B     True
C    False
D    False
E    False
dtype: bool

# Assignment6: TASK2
Write a series contain Marks of 10 studnets and then display the values >= 40

In [None]:
marks= pd.Series(data=[30,70,60,39,35,90,45,53,48,29])
#marks.astype(int)
marks[marks>40]

1    70
2    60
5    90
6    45
7    53
8    48
dtype: int64

### Series Operations

Unlike ndarrays operations between Series (+, -, /, \*) align values based on their **LABELS** (not their position in the structure). The resulting index will be the __*sorted union*__ of the two indexes. This gives you the flexibility to run operations on series regardless of their labels.

In [None]:
s1 = pd.Series(data = range(4),
               index = ["A", "B", "C", "D"])
s1

A    0
B    1
C    2
D    3
dtype: int64

In [None]:
s2 = pd.Series(data = range(10, 14),
               index = ["B", "C", "D", "E"])
s2

B    10
C    11
D    12
E    13
dtype: int64

In [None]:
s1 + s2

A     NaN
B    11.0
C    13.0
D    15.0
E     NaN
dtype: float64

As you can see above, indices that match will be operated on. Indices that don't match will appear in the product but with `NaN` values:

We can also perform standard operations on a series, like multiplying or squaring. NumPy also accepts series as an argument to most functions because series are built off numpy arrays (more on that later):

In [None]:
2**3 #2X2X2

8

In [None]:
s1+2

A    2
B    3
C    4
D    5
dtype: int64

In [None]:
s1 ** 2 # power of

A    0
B    1
C    4
D    9
dtype: int64

In [None]:
np.exp(s1)

A     1.000000
B     2.718282
C     7.389056
D    20.085537
dtype: float64

In [None]:
marks.describe()

count    10.000000
mean     49.900000
std      19.220649
min      29.000000
25%      36.000000
50%      46.500000
75%      58.250000
max      90.000000
dtype: float64

In [None]:
s1.describe() # important function in panda

count    4.000000
mean     1.500000
std      1.290994
min      0.000000
25%      0.750000
50%      1.500000
75%      2.250000
max      3.000000
dtype: float64

Finally, just like arrays, series have many built-in methods for various operations. You can find them all by running `help(pd.Series)`:

In [None]:
#help(pd.Series)

In [None]:
print([_ for _ in dir(pd.Series) if not _.startswith("_")])  # print all common methods

['T', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'array', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'attrs', 'autocorr', 'axes', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'cat', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dt', 'dtype', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'interpolate', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'iteritems', 'keys', 'ku

In [None]:
s1

A    0
B    1
C    2
D    3
dtype: int64

In [None]:
s1.mean()

1.5

In [None]:
s1.sum()

6

In [None]:
s1.astype(float)

A    0.0
B    1.0
C    2.0
D    3.0
dtype: float64

**"Chaining"** operations together is also common with pandas:

In [None]:
print(s1.add(3.141).pow(2).mean()) # not famous


22.788881


### Data Types



1.   Integer --> 3, 8
2.   Floating Point--> 3.4 , 7.8
3.   String



Series can hold all the data types (`dtypes`) you're used to, e.g., `int`, `float`, `bool`, etc. There are a few other special data types too (`object`, `DateTime` and `Categorical`) which we'll talk about in this and later chapters. You can always read more about pandas dtypes [in the documentation too](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes). For example, here's a series of `dtype` int64:

In [None]:
import pandas as pd
x = pd.Series(range(5))
x.dtype

dtype('int64')

The dtype "`object`" is used for series of strings or mixed data. Pandas is [currently experimenting](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.StringDtype.html#pandas.StringDtype) with a dedicated string dtype `StringDtype`, but it is still in testing.

In [None]:
x = pd.Series(['Soud', 'Mustapha ', 9, 8.5])
x

0         Soud
1    Mustapha 
2            9
3          8.5
dtype: object

In [None]:
x = pd.Series(['A', 1, ["I", "AM", "A", "LIST"]])
x

0                   A
1                   1
2    [I, AM, A, LIST]
dtype: object

While flexible, it is recommended to avoid the use of `object` dtypes because of higher memory requirements. Essentially, in an `object` dtype series, every single element stores information about its individual dtype. We can inspect the dtypes of all the elements in a mixed series in several ways, below I'll use the `map` method:

In [None]:
x.map(type)

0      <class 'str'>
1      <class 'str'>
2      <class 'int'>
3    <class 'float'>
dtype: object

We can see that each object in our series has a different dtype. This comes at a cost. Compare the [memory usage](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.memory_usage.html) of the series below:

In [None]:
x1 = pd.Series([1, 2, 3])
print(f"x1 dtype: {x1.dtype}")
print(f"x1 memory usage: {x1.memory_usage(deep=True)} bytes")
print("")
x2 = pd.Series([1, 2, "3"])
print(f"x2 dtype: {x2.dtype}")
print(f"x2 memory usage: {x2.memory_usage(deep=True)} bytes")
print("")
x3 = pd.Series([1, 2, "3"]).astype('int8')  # coerce the object series to int8
print(f"x3 dtype: {x3.dtype}")
print(f"x3 memory usage: {x3.memory_usage(deep=True)} bytes")

x1 dtype: int64
x1 memory usage: 152 bytes

x2 dtype: object
x2 memory usage: 258 bytes

x3 dtype: int8
x3 memory usage: 131 bytes


In summary, try to use uniform dtypes where possible - they are more memory efficient!

One more gotcha, `NaN` (frequently used to represent missing values in data) is a float:

In [None]:
type(np.NaN)

float

This can be problematic if you have a series of integers and one missing value, because Pandas will cast the whole series to a float:

In [None]:
pd.Series([1, 2, 3, np.NaN])

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

Only recently, Pandas has implemented a "[nullable integer dtype](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html)", which can handle `NaN` in an integer series without affecting the `dtype`. Note the captial "I" in the type below, differentiating it from numpy's `int64` dtype:

In [None]:
pd.Series([1, 2, 3, np.NaN]).astype('Int64')

0       1
1       2
2       3
3    <NA>
dtype: Int64

This is not the default in Pandas yet and functionality of this new feature is still subject to change.

## 3. Pandas DataFrames
<hr>

### What are DataFrames?

Pandas DataFrames are you're new best friend. They are like the Excel spreadsheets you may be used to. DataFrames are really just Series stuck together! Think of a DataFrame as a dictionary of series, with the "keys" being the column labels and the "values" being the series data:

![](img/chapter7/dataframe.png)

### Creating DataFrames

Dataframes can be created using `pd.DataFrame()` (note the capital "D" and "F"). Like series, index and column labels of dataframes are labelled starting from 0 by default:

In [None]:
pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [None]:
pd.DataFrame([["a", "b", "c"],
              ["a1", "b1", "c1"],
              ["a2", "b2", "c2"]])

Unnamed: 0,0,1,2
0,a,b,c
1,a1,b1,c1
2,a2,b2,c2


We can use the `index` and `columns` arguments to give them labels:

In [None]:
pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]],
             index = ["R1", "R2", "R3"],
             columns = ["C1", "C2", "C3"])

Unnamed: 0,C1,C2,C3
R1,1,2,3
R2,4,5,6
R3,7,8,9


There are so many ways to create dataframes. I most often create them from dictionaries or ndarrays:

In [None]:
pd.DataFrame({"C1": [1, 2, 3],
              "C2": ['A', 'B', 'C']},
             index=["R1", "R2", "R3"])

Unnamed: 0,C1,C2
R1,1,A
R2,2,B
R3,3,C


create a DataFrame of 3X3 (rows and Columns) by using dictionaries.

In [None]:
np.random.randn(4,,4)

In [None]:
pd.DataFrame(np.random.randn(5, 5),
             index=[f"Row{_}" for _ in range(1, 6)],
             columns=[f"Col{_}" for _ in range(1, 6)])
# index = ["Row1", "R2", "R3"...]
# columns = ["C1", "C2", ....]

Unnamed: 0,Col1,Col2,Col3,Col4,Col5
Row1,-1.211637,0.073931,0.787369,0.158052,-1.164294
Row2,-0.922295,-0.041971,-0.159052,0.786831,-0.26062
Row3,2.86814,-0.579315,0.775386,-1.413414,0.452498
Row4,-1.598858,-0.03699,-0.83018,2.465415,-1.216389
Row5,-0.541532,0.878605,1.47814,0.567929,-0.539553


In [None]:
a = [1,3,6,8]
a[-1]

8

In [None]:

for i in range(1, 5):
  print(i)
  i = i+1


print("value of i is:", i)

1
2
3
4
value of i is: 5


In [None]:
pd.DataFrame(np.array([['Erum', 7], ['Marziya', 15], ['Urfat', 3]]))

Unnamed: 0,0,1
0,Erum,7
1,Marziya,15
2,Urfat,3


KeyError: '0'

from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
  from matplotlib import pyplot as plt
  import seaborn as sns
  palette = list(sns.palettes.mpl_palette('Dark2'))
  counted = (series['index']
                .value_counts()
              .reset_index(name='counts')
              .rename({'index': 'index'}, axis=1)
              .sort_values('index', ascending=True))
  xs = counted['index']
  ys = counted['counts']
  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = _df_3.sort_values('index', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('0')):
  _plot_series(series, series_name, i)
  fig.legend(title='0', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('index')
_ = plt.ylabel('count()')

KeyError: '1'

from matplotlib import pyplot as plt
import seaborn as sns
def _plot_series(series, series_name, series_index=0):
  from matplotlib import pyplot as plt
  import seaborn as sns
  palette = list(sns.palettes.mpl_palette('Dark2'))
  counted = (series['index']
                .value_counts()
              .reset_index(name='counts')
              .rename({'index': 'index'}, axis=1)
              .sort_values('index', ascending=True))
  xs = counted['index']
  ys = counted['counts']
  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = _df_4.sort_values('index', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('1')):
  _plot_series(series, series_name, i)
  fig.legend(title='1', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('index')
_ = plt.ylabel('count()')

KeyError: '0'

from matplotlib import pyplot as plt
import seaborn as sns
figsize = (12, 1.2 * len(_df_7['0'].unique()))
plt.figure(figsize=figsize)
sns.violinplot(_df_7, x='index', y='0', inner='stick', palette='Dark2')
sns.despine(top=True, right=True, bottom=True, left=True)

KeyError: '1'

from matplotlib import pyplot as plt
import seaborn as sns
figsize = (12, 1.2 * len(_df_8['1'].unique()))
plt.figure(figsize=figsize)
sns.violinplot(_df_8, x='index', y='1', inner='stick', palette='Dark2')
sns.despine(top=True, right=True, bottom=True, left=True)

In [None]:
pd.DataFrame(np.array([['Erum', 7], ['Marziya', 15], ['Urfat', 3]]))

Unnamed: 0,0,1
0,Erum,7
1,Marziya,15
2,Urfat,3


Here's a table of the main ways you can create dataframes (see the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) for more):

|Create DataFrame from|Code|
|---|---|
|Lists of lists|`pd.DataFrame([['Tom', 7], ['Mike', 15], ['Tiffany', 3]])`|
|ndarray|`pd.DataFrame(np.array([['Tom', 7], ['Mike', 15], ['Tiffany', 3]]))`|
|Dictionary|`pd.DataFrame({"Name": ['Tom', 'Mike', 'Tiffany'], "Number": [7, 15, 3]})`|
|List of tuples|`pd.DataFrame(zip(['Tom', 'Mike', 'Tiffany'], [7, 15, 3]))`|
|Series|`pd.DataFrame({"Name": pd.Series(['Tom', 'Mike', 'Tiffany']), "Number": pd.Series([7, 15, 3])})`|


### Indexing and Slicing DataFrames

There are several main ways to select data from a DataFrame:
1. `[]`
2. `.loc[]`
3. `.iloc[]`
4. Boolean indexing
5. `.query()`

In [None]:
df = pd.DataFrame({"Name": ["Erum", "Yusra", "Mukhtar", "Faten"],
                   "Language": ["Python", "Java", "R", "C++"],
                   "Courses": [5, 4, 7, 6]})
df

Unnamed: 0,Name,Language,Courses
0,Erum,Python,5
1,Yusra,Java,4
2,Mukhtar,R,7
3,Faten,C++,6


#### Indexing with `[]`
Select columns by single labels, lists of labels, or slices:

In [None]:
df['Language']  # returns a series

0    Python
1      Java
2         R
3       C++
Name: Language, dtype: object

In [None]:
df[['Name']]  # returns a dataframe!

Unnamed: 0,Name
0,Erum
1,Yusra
2,Mukhtar
3,Faten


In [None]:
df[['Name', 'Language']]

Unnamed: 0,Name,Language
0,Erum,Python
1,Yusra,Java
2,Mukhtar,R


In [None]:
list = ["Erum", "Yusra", "Mukhtar"]
list[1]

'Yusra'

You can only index rows by using slices, not single values (but not recommended, see preferred methods below).

In [None]:
df[0] # doesn't work

KeyError: ignored

In [None]:
df[0:3] # does work

Unnamed: 0,Name,Language,Courses
0,Erum,Python,5
1,Yusra,Java,4
2,Mukhtar,R,7


In [None]:
df[1:] # does work

Unnamed: 0,Name,Language,Courses
1,Yusra,Java,4
2,Mukhtar,R,7
3,Faten,C++,6


#### Indexing with `.loc` and `.iloc`
Pandas created the methods `.loc[]` and `.iloc[]` as more flexible alternatives for accessing data from a dataframe. Use `df.iloc[]` for indexing with integers and `df.loc[]` for indexing with labels. These are typically the [recommended methods of indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated) in Pandas.

In [None]:
df

Unnamed: 0,Name,Language,Courses
0,Erum,Python,5
1,Yusra,Java,4
2,Mukhtar,R,7
3,Faten,C++,6


First we'll try out `.iloc` which accepts *integers* as references to rows/columns:

In [None]:
df.iloc[1]  # returns a series

Name        Yusra
Language     Java
Courses         4
Name: 1, dtype: object

In [None]:
df.iloc[0:3]  # slicing returns a dataframe

Unnamed: 0,Name,Language,Courses
0,Erum,Python,5
1,Yusra,Java,4
2,Mukhtar,R,7


In [None]:
df.iloc[2, 1]  # returns the indexed object

'R'

In [None]:
test = df.iloc[0, 2]
print(test)
df.iloc[0:1]

5


Unnamed: 0,Name,Language,Courses
0,Erum,Python,5


In [None]:
df.iloc[[0, 1], [0, 2]]  # returns a dataframe

Unnamed: 0,Name,Courses
0,Erum,5
1,Yusra,4


Now let's look at `.loc` which accepts *labels* as references to rows/columns:

In [None]:
df.loc[:, 'Name']

0       Erum
1      Yusra
2    Mukhtar
Name: Name, dtype: object

In [None]:
df.loc[:, 'Name':'Language']

Unnamed: 0,Name,Language
0,Erum,Python
1,Yusra,Java
2,Mukhtar,R
3,Faten,C++


In [None]:
df.loc[[1, 2], ['Language']]

Unnamed: 0,Language
1,Java
2,R


Sometimes we want to use a mix of integers and labels to reference data in a dataframe. The easiest way to do this is to use `.loc[]` with a label then use an integer in combinations with `.index` or `.columns`:

In [None]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
df.columns

Index(['Name', 'Language', 'Courses'], dtype='object')

In [None]:
df.loc[df.index[2], 'Courses']  # I want to reference the first row and the column named "Courses"

7

In [None]:
df.loc[2, df.columns[1]]  # I want to reference row "2" and the second column

'R'

#### Boolean indexing
Just like with series, we can select data based on boolean masks:

In [None]:
df

Unnamed: 0,Name,Language,Courses
0,Erum,Python,5
1,Yusra,Java,4
2,Mukhtar,R,7
3,Faten,C++,6


In [None]:
df[df['Courses'] != 5]

Unnamed: 0,Name,Language,Courses
1,Yusra,Java,4
2,Mukhtar,R,7
3,Faten,C++,6


In [None]:
df[df['Name'] == "Erum"]

Unnamed: 0,Name,Language,Courses
0,Erum,Python,5


#### Indexing with `.query()`
Boolean masks work fine, but I prefer to use the `.query()` method for selecting data. `df.query()` is a powerful tool for filtering data. It has an odd syntax, one of the strangest I've seen in Python, it is more like SQL - `df.query()` accepts a string expression to evaluate and it "knows" the names of the columns in your dataframe.

In [None]:
df.query("Courses > 4 & Language != 'Python'")

Unnamed: 0,Name,Language,Courses
2,Mukhtar,R,7
3,Faten,C++,6


Note the use of single quotes AND double quotes above, lucky we have both in Python! Compare this to the equivalent boolean indexing operation and you can see that `.query()` is much more readable, especially as the query gets bigger!

In [None]:
df[(df['Courses'] > 4) & (df['Language'] != 'Python')]

Unnamed: 0,Name,Language,Courses
2,Mukhtar,R,7
3,Faten,C++,6


Query also allows you to reference variable in the current workspace using the `@` symbol:

In [None]:
print(10)

x = 10
print(x)

10
10


In [None]:
course_threshold = 4
df.query("Courses > @course_threshold")

Unnamed: 0,Name,Language,Courses
0,Erum,Python,5
2,Mukhtar,R,7
3,Faten,C++,6


#### Indexing cheatsheet

|Method|Syntax|Output|
|---|---|---|
|Select column|`df[col_label]`|Series|
|Select row slice|`df[row_1_int:row_2_int]`|DataFrame|
|Select row/column by label|`df.loc[row_label(s), col_label(s)]`|Object for single selection, Series for one row/column, otherwise DataFrame|
|Select row/column by integer|`df.iloc[row_int(s), col_int(s)]`|Object for single selection, Series for one row/column, otherwise DataFrame|
|Select by row integer & column label|`df.loc[df.index[row_int], col_label]`|Object for single selection, Series for one row/column, otherwise DataFrame|
|Select by row label & column integer|`df.loc[row_label, df.columns[col_int]]`|Object for single selection, Series for one row/column, otherwise DataFrame|
|Select by boolean|`df[bool_vec]`|Object for single selection, Series for one row/column, otherwise DataFrame|
|Select by boolean expression|`df.query("expression")`|Object for single selection, Series for one row/column, otherwise DataFrame|

# Review


*   Created Datafram
*   Accessed value using indexing, .loc and iloc



### Reading/Writing Data From External Sources

#### .csv files

A lot of the time you will be loading .csv files for use in pandas. You can use the `pd.read_csv()` function for this. In the following chapters we'll use a real dataset of my cycling commutes to the University of British Columbia. There are so many arguments that can be used to help read in your .csv file in an efficient and appropriate manner, feel free to check them out now (by using `shift + tab` in Jupyter, or typing `help(pd.read_csv)`).

In [None]:
from google.colab import files
uploaded = files.upload()

Saving Titanic_dataset_training.csv to Titanic_dataset_training.csv


In [None]:
#path = 'data/cycling_data.csv'Titanic_dataset_training.csv
df = pd.read_csv("Titanic_dataset_training.csv", index_col=0, parse_dates=True)
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
df.columns
df.describe()
df["Survived"]

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

You can print a dataframe to .csv using `df.to_csv()`. Be sure to check out all of the possible arguments to write your dataframe exactly how you want it.

#### url

Pandas also facilitates reading directly from a url - `pd.read_csv()` accepts urls as input:

In [None]:
url = 'https://raw.githubusercontent.com/TomasBeuzen/toy-datasets/master/wine_1.csv'
df_1 =pd.read_csv(url)


In [None]:
df_1.head(10)
df_1.shape # number of rows and columns in the dataframe

(5, 7)

#### Other
Pandas can read data from all sorts of other file types including HTML, JSON, Excel, Parquet, Feather, etc. There are generally dedicated functions for reading these file types, see the [Pandas documentation here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5).

### Common DataFrame Operations

DataFrames have built-in functions for performing most common operations, e.g., `.min()`, `idxmin()`, `sort_values()`, etc. They're all documented in the [Pandas documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) but I'll demonstrate a few below:

In [None]:
df = pd.read_csv('Titanic_dataset_training.csv')
print(df.head(10))
print(df.shape)

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male   NaN      0   
6                            McCarthy, Mr. Timothy J    male  54

In [None]:
df.min()

  df.min()


PassengerId                      1
Survived                         0
Pclass                           1
Name           Abbing, Mr. Anthony
Sex                         female
Age                           0.42
SibSp                            0
Parch                            0
Ticket                      110152
Fare                           0.0
dtype: object

In [None]:
df['Age'].min()

0.42

In [None]:
df['Age'].idxmin()

803

In [None]:
df.iloc[803]

PassengerId                                804
Survived                                     1
Pclass                                       3
Name           Thomas, Master. Assad Alexander
Sex                                       male
Age                                       0.42
SibSp                                        0
Parch                                        1
Ticket                                    2625
Fare                                    8.5167
Cabin                                      NaN
Embarked                                     C
Name: 803, dtype: object

In [None]:
df.sum()

  df.sum()


PassengerId                                               397386
Survived                                                     342
Pclass                                                      2057
Name           Braund, Mr. Owen HarrisCumings, Mrs. John Brad...
Sex            malefemalefemalefemalemalemalemalemalefemalefe...
Age                                                     21205.17
SibSp                                                        466
Parch                                                        340
Ticket         A/5 21171PC 17599STON/O2. 31012821138033734503...
Fare                                                  28693.9493
dtype: object

Some methods like `.mean()` will only operate on numeric columns:

In [None]:
df.mean()

  df.mean()


PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

Some methods require arguments to be specified, like `.sort_values()`:

In [None]:
df.sort_values(by='Age')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5000,,S
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [None]:
df.sort_values(by='Age', ascending=False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.7500,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


Some methods will operate on the index/columns, like `.sort_index()`:

In [None]:
df.sort_index(ascending=False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


## 4. Why ndarrays and Series and DataFrames?

At this point, you might be asking why we need all these different data structures. Well, they all serve different purposes and are suited to different tasks. For example:
- NumPy is typically faster/uses less memory than Pandas;
- not all Python packages are compatible with NumPy & Pandas;
- the ability to add labels to data can be useful (e.g., for time series);
- NumPy and Pandas have different built-in functions available.

My advice: use the simplest data structure that fulfills your needs!

Finally, we've seen how to go from: ndarray (`np.array()`) -> series (`pd.series()`) -> dataframe (`pd.DataFrame()`). Remember that we can also go the other way: dataframe/series -> ndarray using `df.to_numpy()`.

# Assignment (optional)
Find a dataset in CSV file and apply all function we did on Titanic dataset. This dataset will be garded as your project.
we will apply Pandas, EDA, Visualization and ML Algorithms.


1.   Research Question(What you want to extract from the Data) e,g. In Titanic Dataset we want tp predict Survival Rate
2.   We will use same dataset for next 5-6 Sessions(This will be a mini Projects)
3. Dataset can be found from Kaggle or any website of your choice.

