## **EDA (Dataframe Operations)**

### Packages Used for Different Process & Frameworks

**EDA**

- Pandas     : Dataframe or data applications
- Numpy      : Numerical Python for math operations
- Matplotlib : For plots, graphs
- Seaborn    : For plots, graphs
- Bokhe      : For plots
- Plotly     : For plots


**Machine Learning**

- Scikit learn (SKlearn)  : ML models
- Stats                   : Statistical models


**Data retrival from sql scrapping**

- sqlite           : connect with database
- Beautiful soup   : scrap the data from website
- websocket        : scrap

**Deep Learning/Computer Vision**

- Tensorflow       : developed by Google
- Keras            : DL models
- OpenCV           : Image operations (Computer vision)
- Pillow (PIL)     : Image operations
- Pytorch          : developed by facebook
- Pretrained model :
    - VGG16
    - Mobilenet
    - Resnet
    - rcnn
    - maskrcnn
    - fastercnn
    - yolo

**Natural language processing**

- NLTK     : natural language process
- Scipy    :
- word2vec (How words will convert into vector) :
- Glove
- wordcloud


**Hugging face**

- Transformers : BERT


**GenAI models**

- OpenAI + Microsoft : GPT models
- Google : Gemini
- Amazon : Q
- Meta facebook : MetaAI
- Apple : 

    - LLAMA
    - GEMMA
    - Dalle-E
    - TS

- Different Companies have different different packages

**Langchain**

- Langchain is a big framework
- so pip install langchain - it covers all GenAI models of all companies


**Deployment/Applications**

- Flask
- Streamlit
- Django


**MLops**

- Mlflow
- Kuberflow
- Azure AI services : Azure packages
- GCP : GCP packages

**Step 1**

- Create a Dataframe
    - We will create a dataframe using list
    - We will create a dataframe using dictionary

In [1]:
import pandas as pd

In [4]:
d1 = pd.DataFrame()   # empty dataframe
d1

In [5]:
print(type(d1))  # This will give the datatype

<class 'pandas.core.frame.DataFrame'>


**Step 2**

- Provide the data using list

In [6]:
names = ['tarun','surender','mohith']
pd.DataFrame(names)

Unnamed: 0,0
0,tarun
1,surender
2,mohith


In [7]:
names = ['tarun','surender','mohith']
age = [20,25,30]
pd.DataFrame(zip(names,age))   # We can use zip if want to create more than 1 column

Unnamed: 0,0,1
0,tarun,20
1,surender,25
2,mohith,30


In [8]:
names = ['tarun','surender','mohith']
age = [20,25,30]
city = ['blr','hyd','pune']
pd.DataFrame(zip(names,age,city))

Unnamed: 0,0,1,2
0,tarun,20,blr
1,surender,25,hyd
2,mohith,30,pune


**Step 3**

- Provide the column names
- If multiple names are there, always put multiple names in a list

In [10]:
pd.DataFrame(zip(names,age,city),columns=['Name','Age','City'])

Unnamed: 0,Name,Age,City
0,tarun,20,blr
1,surender,25,hyd
2,mohith,30,pune


**Step 4**

- Providing index

In [11]:
pd.DataFrame(zip(names,age,city),index=['a','b','c'],columns=['Name','Age','City'])

Unnamed: 0,Name,Age,City
a,tarun,20,blr
b,surender,25,hyd
c,mohith,30,pune


In [12]:
pd.DataFrame(zip(names,age),index=city,columns=['Name','Age'])  # we can create index as one of the column as well

Unnamed: 0,Name,Age
blr,tarun,20
hyd,surender,25
pune,mohith,30


- if we have 100 rows, then we will use list comprehension
- if we have only 1 column, then also good practice to use list

In [14]:
df.describe()  # describe method will give descriptive statistical information about numerical columns

Unnamed: 0,Age
count,3.0
mean,25.0
std,5.0
min,20.0
25%,22.5
50%,25.0
75%,27.5
max,30.0


**Step 5**

- Add a new column on existing dataframe

In [6]:
import pandas as pd
names = ['Rahul','Rohan','Rajesh']
age = [20,25,30]
city = ['hyd','blr','pune']
df = pd.DataFrame(zip(names,age,city),columns=['Name','Age','City'])
print(df)

     Name  Age  City
0   Rahul   20   hyd
1   Rohan   25   blr
2  Rajesh   30  pune


- If you want to add a new column :
    - first check how many rows are there
    - For example, in above dataframe, we have 3 rows
    - your new column also should have 3 rows
    - suppose I want to create a new column called 'Salary'
    - so make a list with 3 values
    - So, salary = [100000,200000,300000]
    - If we dont provide exact values as no of rows, i.e. less or more    values than error

**syntax** - : dataframe[column] = list of values
- df['Salary'] = salary

In [10]:
salary = [100000,200000,300000]
df['Sal'] = salary

In [11]:
df

Unnamed: 0,Name,Age,City,Sal
0,Rahul,20,hyd,100000
1,Rohan,25,blr,200000
2,Rajesh,30,pune,300000


**Use Case 1**

- Create a empty dataframe first
- create 3 lists using list comprehension
- list1 = values from 1 to 10
- list2 = square of 1 to 10
- list3 = cube of 1 to 10
- Finally make a dataframe

In [2]:
df = pd.DataFrame()
list1 = [i for i in range(1,11)]
list2 = [i**2 for i in range(1,11)]
list3 = [i**3 for i in range(1,11)]

df['Normal'] = list1
df['Square'] = list2
df['Cube'] = list3

######  OR  ######

# df1 = pd.DataFrame(zip(list1,list2,list3),columns=['Normal','Square','Cube'])

df

Unnamed: 0,Normal,Square,Cube
0,1,1,1
1,2,4,8
2,3,9,27
3,4,16,64
4,5,25,125
5,6,36,216
6,7,49,343
7,8,64,512
8,9,81,729
9,10,100,1000


**Step 6**
- For creating a new column
    - creating a new list
    - df['new_column_name of our choice']
- For updating already existing column same way only
    - Here also, we need to make a list of values = number of rows
    - while updating the existing column
    - we need to provide already existed column name only
    - df['Z'] = list of values,  Here 'Z' is already existed column

In [14]:
# df[]

df['Cube'] = [i**4 for i in range(1,11)]   # updating 'cube' column with power of 4 values
df

Unnamed: 0,Normal,Square,Cube
0,1,1,1
1,2,4,16
2,3,9,81
3,4,16,256
4,5,25,625
5,6,36,1296
6,7,49,2401
7,8,64,4096
8,9,81,6561
9,10,100,10000


**Step 7**

- How to drop a column
    - If you want to drop a specifc column, we will use drop method
    - we already know about difference b/w keywords & methods
    - keywords we use directly
    - methods we use with corresponding dataframe
    - **dataframe_name.drop() method**
        - column : which column you want to drop
        - axis : represents the specified value is a row or column
            - axis = 0 means rows (default value)
            - axis = 1 means column
        - inplace : will provide overwrite the df or not
            - inplace = true, this indicates overwrite the df
            - inplace = False, modified df will be used when you provide new variable for the operation

In [20]:
df.drop(['Cube'],axis=1,inplace=True)
df

Unnamed: 0,Normal,Square
0,1,1
1,2,4
2,3,9
3,4,16
4,5,25
5,6,36
6,7,49
7,8,64
8,9,81
9,10,100


In [22]:
df.drop(['Square'],axis=1,inplace=True)
df

Unnamed: 0,Normal
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


In [27]:
# Here, we are not using inplace = True
df_drop = df1.drop(['Cube'],axis=1)   # So, df1 will remain as it is
df_drop

Unnamed: 0,Normal,Square
0,1,1
1,2,4
2,3,9
3,4,16
4,5,25
5,6,36
6,7,49
7,8,64
8,9,81
9,10,100


In [28]:
df1

Unnamed: 0,Normal,Square,Cube
0,1,1,1
1,2,4,8
2,3,9,27
3,4,16,64
4,5,25,125
5,6,36,216
6,7,49,343
7,8,64,512
8,9,81,729
9,10,100,1000


In [36]:
# one way of replacing items using 'loc'
df1.loc[3,'Square'] = 1600

In [38]:
# another way of reaplcing cell values is to use 'dataframe.replace' method
df1.replace(to_replace=1600,value=16,inplace=True)

In [40]:
# if values are repeating in the dataframe, then it will replace all values
df1.replace(to_replace=[1,4,9,8],value=[0.12,-2.34,-32.13,19.65],inplace=True)

In [41]:
df1

Unnamed: 0,Normal,Square,Cube
0,0.12,0.12,0.12
1,2.0,-2.34,19.65
2,3.0,-32.13,27.0
3,-2.34,16.0,64.0
4,5.0,25.0,125.0
5,6.0,36.0,216.0
6,7.0,49.0,343.0
7,19.65,64.0,512.0
8,-32.13,81.0,729.0
9,10.0,100.0,1000.0


**Step 8 Drop the row**

- drop the column & drop the row, both are similar only
- the only difference is axis
    - axis = 0 represents rows
    - by default axis = 0 avialable
    - which means if we provide any value, it will represent rows only

In [3]:
# if we run this, we will get error (keyerror)
# by default, axis = 0 which represents about rows
# but 'Number' is a column
df.drop(['Number'])

KeyError: "['Number'] not found in axis"

In [4]:
df

Unnamed: 0,Normal,Square,Cube
0,1,1,1
1,2,4,8
2,3,9,27
3,4,16,64
4,5,25,125
5,6,36,216
6,7,49,343
7,8,64,512
8,9,81,729
9,10,100,1000


In [6]:
# if you want to drop row, then provide the row index
# multiple rows or multiple columns, keep always in a list
df1 = df.drop(index=[3,6])

In [11]:
df1.columns[[0,1]]

Index(['Normal', 'Square'], dtype='object')

**Step 9 How to select columns**

In [17]:
print(df1['Normal'])
print(type(df1['Normal']))

# Series type
# No column name display
# whenever you see this type : Series type

0     1
1     2
2     3
4     5
5     6
7     8
8     9
9    10
Name: Normal, dtype: int64
<class 'pandas.core.series.Series'>


In [16]:
print(df1[['Normal']])
print(type(df1[['Normal']]))

# Dataframe

   Normal
0       1
1       2
2       3
4       5
5       6
7       8
8       9
9      10
<class 'pandas.core.frame.DataFrame'>


In [18]:
# Another way to print Series
df1.Normal

0     1
1     2
2     3
4     5
5     6
7     8
8     9
9    10
Name: Normal, dtype: int64

- df1['Normal'] : Series type
- df1.Normal : Series type
- df1[['Normal']] : Dataframe type

In [19]:
df1[['Square','Cube']]

Unnamed: 0,Square,Cube
0,1,1
1,4,8
2,9,27
4,25,125
5,36,216
7,64,512
8,81,729
9,100,1000


In [20]:
df1['Square'].values
# Here type is array

array([  1,   4,   9,  25,  36,  64,  81, 100], dtype=int64)

- All values comes in an array

- Array means it is a list of elements only, represents using NumPy package

- Generally, elements represent/stored in 3 ways:
    - List         : Basic representation
    - Numpy array  : Array with numpy package
    - Tensors      : Tensorflow package
    - Torch        : Pytorch package
  

In [21]:
list1 = [1,2,3,4]
list1

[1, 2, 3, 4]

In [25]:
import numpy as np
arr1 = np.array(list1)
print(arr1)

[1 2 3 4]


In [24]:
list2 = [10,20,30,40]
arr2 = np.array(list2)
print(arr2)

[10 20 30 40]


In [28]:
# Concatenation in Lists
print(list1 + list2)

[1, 2, 3, 4, 10, 20, 30, 40]


In [27]:
# Adding elements wise in arrays
print(arr1 + arr2)

[11 22 33 44]


In [29]:
df1['Cube2'] = df1['Cube'].values

# Imagine we worked on an ML problem
# we got some series of values
# with that series of values also, we can create a dataframe

In [30]:
df1

Unnamed: 0,Normal,Square,Cube,Cube2
0,1,1,1,1
1,2,4,8,8
2,3,9,27,27
4,5,25,125,125
5,6,36,216,216
7,8,64,512,512
8,9,81,729,729
9,10,100,1000,1000


**Step 10 How to save the dataframe**
- Dataframes we can save in 2 formats
    - csv : comma seperated values
    - xlsx : excel format
- In order to save any data
    - where you want to save : directory name
    - what is the name of the file for storing
    - what is the extension
- We already seen in file handling session

In [None]:
# Case-1 : saving file at same location
        # No need to provide directory path
        # Directly give only name of the file & extension

# Case-2 : Saving file at other location
        # Provide the path, then file name

In [33]:
df1.to_csv('Data1.csv')
df1.to_excel('Data1.xlsx')

**Step 11 How to read a dataframe**

In [34]:
df1_csv = pd.read_csv('Data1.csv')
df1_xlsx = pd.read_excel('Data1.xlsx')

In [35]:
df1_csv

Unnamed: 0.1,Unnamed: 0,Normal,Square,Cube,Cube2
0,0,1,1,1,1
1,1,2,4,8,8
2,2,3,9,27,27
3,4,5,25,125,125
4,5,6,36,216,216
5,7,8,64,512,512
6,8,9,81,729,729
7,9,10,100,1000,1000


In [36]:
df1_xlsx

Unnamed: 0.1,Unnamed: 0,Normal,Square,Cube,Cube2
0,0,1,1,1,1
1,1,2,4,8,8
2,2,3,9,27,27
3,4,5,25,125,125
4,5,6,36,216,216
5,7,8,64,512,512
6,8,9,81,729,729
7,9,10,100,1000,1000


In [37]:
pwd()

"C:\\Users\\thaku\\Documents\\MacBook\\Shubham's Stuff\\Naresh IT\\Data_Science_AI_Omkar Sir\\EDA Lectures\\Shubham"

**Step 12 How to avoid extra row index column while saving file**

- Automatically a new column of row index values will create
- that we need to avoid
- A general error - Permission denied use to come which means file is open so, close the data file from laptop

In [39]:
df1.to_csv('Data1.csv',index=False)
df1.to_excel('Data1.xlsx',index=False)

In [42]:
df1_csv = pd.read_csv('Data1.csv')
df1_xlsx = pd.read_excel('Data1.xlsx')

In [41]:
df1_csv

Unnamed: 0,Normal,Square,Cube,Cube2
0,1,1,1,1
1,2,4,8,8
2,3,9,27,27
3,5,25,125,125
4,6,36,216,216
5,8,64,512,512
6,9,81,729,729
7,10,100,1000,1000


In [43]:
df1_xlsx

Unnamed: 0,Normal,Square,Cube,Cube2
0,1,1,1,1
1,2,4,8,8
2,3,9,27,27
3,5,25,125,125
4,6,36,216,216
5,8,64,512,512
6,9,81,729,729
7,10,100,1000,1000
