Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages that makes importing and analyzing data much easier

In [3]:
import pandas as pd

In [4]:
# Create data - Dictionanry
data = {
    "Name": ['A', 'B','C'],
    "Age":[23, 25,30],
    "Dept": ['Marketing','Finance','Operations']
}

# Convert in df
df = pd.DataFrame(data)
print(df)

  Name  Age        Dept
0    A   23   Marketing
1    B   25     Finance
2    C   30  Operations


In [5]:
# Test 2
data = {
    "Name": ['A','B',None],
    "Age":[25, 30, 32],
    "Dept":['Sales','MIS','Accounts']
}

# DF conversion
df = pd.DataFrame(data)
print(df.info())               # info give basic info about dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    2 non-null      object
 1   Age     3 non-null      int64 
 2   Dept    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
None


<h3Idenxing & Selection

In [6]:
# column names
df.columns

Index(['Name', 'Age', 'Dept'], dtype='object')

In [7]:
print(df)

   Name  Age      Dept
0     A   25     Sales
1     B   30       MIS
2  None   32  Accounts


In [8]:
# How to see row level, use index
print(df.index)

RangeIndex(start=0, stop=3, step=1)


**index shows : start, end, step**

#### How to Create Pandas DataFrame with Random Data

In [9]:
# Import required package
import pandas as pd
import numpy as np

# create dataframe
df = pd.DataFrame(np.random.randint(0,100, size=(10,3)), columns=list('ABC'))

#Display
print(df)

    A   B   C
0  49  83   8
1  11  71   4
2   2  33  23
3  96  49  51
4  59  94  95
5  59  74   6
6  76  86  12
7  95  81  38
8  13  64   0
9  29  90  95


In [10]:
# 'loc' selecting a specific row by index / we use loc to fetch info for a specific row(s), lets explore row no 4
print(df.loc[4])

A    59
B    94
C    95
Name: 4, dtype: int64


- `loc` and `iloc` are two <b> functions </b> in Pandas that are used to slice a data set in a Pandas DataFrame. 
- `loc` and `iloc`index position must passed through square bracket 

In [11]:
df = pd.DataFrame(np.random.randint(0,200, size=(10,5)), columns=list('ABCDE'))
print(df)

     A    B    C    D    E
0   14   28   65  111  151
1  175  155   59  164  161
2  138   71   17  192   70
3  175  129  113  191  112
4    7   15  193   74   54
5   97   67    5   17  115
6  189   36  127   73   86
7   48  160   74   85   45
8  100   84   73   48   82
9   67    4  148  131  192


In [12]:
df.loc[5]

A     97
B     67
C      5
D     17
E    115
Name: 5, dtype: int64

## iloc

The `.iloc` function in pandas is used for **integer-location-based indexing**. It allows you to select specific `rows and columns` by their `integer positions`, making it a powerful tool for data manipulation and extraction based on numerical positions within a DataFrame.
- That mean you need fetch by row or column position number (integer) not by row or column label

In [13]:
# Excercise 1:
# Sample Dataset

data = {
    'A': [1, 2, 3, 4, 5], 
    'B': [6, 7, 8, 9, 10],
    'C': [11, 12, 13, 14, 15]
}

df = pd.DataFrame(data)
print(df)

# select firt row
first_row = df.loc[0]  # output[1, 6, 11]
print(first_row)

   A   B   C
0  1   6  11
1  2   7  12
2  3   8  13
3  4   9  14
4  5  10  15
A     1
B     6
C    11
Name: 0, dtype: int64


In [14]:
# select 3rd row e.g 2 index       expected output: [3, 8, 13]
first_row = df.iloc[2]
print(first_row)

A     3
B     8
C    13
Name: 2, dtype: int64


#### Selecting Multiple Rows & columns

In [15]:
print(df)

   A   B   C
0  1   6  11
1  2   7  12
2  3   8  13
3  4   9  14
4  5  10  15


In [16]:
# Select the first row and the first column
specific_cell = df.iloc[0, 0]
print(f'First Row & First Column is :', specific_cell)


First Row & First Column is : 1


In [17]:
# Select the first two rows and the first two columns
subset = df.iloc[0:2, 0:2]                             # first 2 rows 0:2       first 2 cols: 0:2
print(subset)

   A  B
0  1  6
1  2  7


### Using a List of Integers

In [18]:
print(df)

   A   B   C
0  1   6  11
1  2   7  12
2  3   8  13
3  4   9  14
4  5  10  15


In [19]:
# Select specific rows
specific_rows = df.iloc[[0, 2]]    # rows / index no 0, 2
print(specific_rows)


   A  B   C
0  1  6  11
2  3  8  13


### Using Boolean Indexing:

In [20]:
# Select rows where the index is even
even_rows = df.iloc[lambda x: x.index % 2 == 0]
print(even_rows)

   A   B   C
0  1   6  11
2  3   8  13
4  5  10  15


- it check even index (0, 2, 4), not even values
- Boolean index Mean : check  the index postion either Odd or Even (True/False)

In [21]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
           {'a': 100, 'b': 200, 'c': 300, 'd': 400},
           {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000}]
df = pd.DataFrame(mydict)
print(df)
print(f'\nWith a Scaler Integer :\n',df.iloc[0])   # new line by backslash

      a     b     c     d
0     1     2     3     4
1   100   200   300   400
2  1000  2000  3000  4000

With a Scaler Integer :
 a    1
b    2
c    3
d    4
Name: 0, dtype: int64


In [22]:
df.iloc[[0, 2], [1, 3]]  # row 0 & 2  ; col 1 & 3

Unnamed: 0,b,d
0,2,4
2,2000,4000


In [23]:
df.iloc[1:3, 0:3] # row 1 to 3  ; col 0 to 3

Unnamed: 0,a,b,c
1,100,200,300
2,1000,2000,3000


In [24]:
df.iloc[:, [True, False, True, False]]

Unnamed: 0,a,c
0,1,3
1,100,300
2,1000,3000


### Selecting Column

In [25]:
# Create data - Dictionanry
data = {
    "Name": ['A', 'B','C'],
    "Age":[23, 25,30],
    "Dept": ['Marketing','Finance','Operations']
}

# Convert in df
df = pd.DataFrame(data)
print(df)

  Name  Age        Dept
0    A   23   Marketing
1    B   25     Finance
2    C   30  Operations


In [26]:
# select column age, sigle column its case sensitive
df['Age']

0    23
1    25
2    30
Name: Age, dtype: int64

In [27]:
# select multile cols # 2D
df[['Name','Age']]

Unnamed: 0,Name,Age
0,A,23
1,B,25
2,C,30


### Adding & Removing Columns

In [28]:
print(df)

  Name  Age        Dept
0    A   23   Marketing
1    B   25     Finance
2    C   30  Operations


In [29]:
## Add Salary to df
df['Salary'] = [10000,15000,18000]
print(df)

  Name  Age        Dept  Salary
0    A   23   Marketing   10000
1    B   25     Finance   15000
2    C   30  Operations   18000


In [31]:
# Removing column
rem = df.drop('Salary', axis=1 )
print(rem)

  Name  Age        Dept
0    A   23   Marketing
1    B   25     Finance
2    C   30  Operations


# Notes

    Q. What is the purpose of using 'index' ?
    A. By using index, we can see the level of row!