# <font color='yellow'> Pandas Introduction

## <font color='orange'> What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

## <font color='Orange'> Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

If a library is not installed, install them from pip (Python's package installer)

**!pip install pandas or !pip install numpy**

In [None]:
import pandas as pd
import numpy as np

## <font color='orange'>  Pandas Series

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

## <font color='orange'> Labels

If nothing else is specified, the values are labeled with their index number.

**First value has index 0, second value has index 1 etc**.

This label can be used to access a specified value.

Return the first value of the Series:print(myvar[0])
    
Create Labels

With the index argument, you can name your own labels.

# <font color='orange'> Key Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

Create a simple Pandas Series from a dictionary:

In [None]:
#Creating a series

a = [1, 7, 2]

myvar = pd.Series(a)

myvar1 = pd.Series(a, index = ["x", "y", "z"])

print(myvar)
print(myvar1)

In [None]:
#using tuple
t1=(1,2,3,4)
t1S=pd.Series(t1,index=["A","B","C","D"])
print(t1S)

In [None]:
#using list
L1=[1,2,3,4]
L1S=pd.Series(L1,index=["A","B","C","D"])
print(L1S)

In [None]:
#using dict
d1={"x":"asdf",1:12,'1':"mnb"}
d1S=pd.Series(d1)
print(d1S)

In [None]:
list(range(1,12))

In [None]:
np.arange(1,10,1/7)#,dtype=float)

In [None]:
#using numpy
a1=np.array([1,2,3,4]) 
a1S=pd.Series(a1)
print(a1S)

s11=np.round(np.arange(1,10,1/7,dtype=float),3)
s1=pd.Series(s11)
print(s1)

In [None]:
print(t1S+2)
print(d1S*3)
print(s1**4)
print(np.round(s1/7,2))
print(L1S**L1S)

# <font color='orange'> Named Indexes in Series

With the index argument, you can name your own indexes.

Add a list of names to give each row a name:

### <font color='pink'> Locate Named Indexes *loc* attribute

Use the named index in the loc attribute to return the specified row(s).

Return "day2":

In [None]:
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)

In [None]:
type(myvar)

In [None]:
myvar.loc['day2']

## <font color='Orange'> DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

Series is like a column, a DataFrame is the whole table.

Create a DataFrame from two Series:

In [None]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
EX = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(EX)

In [None]:
print(pd.DataFrame(data))

## <font color='orange'> Dictionary to DF

In [None]:
# Making dataframe from a dictionary
df2 = pd.DataFrame({'A': 1.,
                        'B': pd.Timestamp('20130102'),
                        'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                        'D': np.array([3] * 4, dtype='int32'),
                        'E': pd.Categorical(["test", "train", "test", "train"]),
                         'F': 'foo'},index=(1,2,3,4))
print(df2)

## <font color='orange'> List to DF

In [None]:
# list of strings
lst = ['alpha', 'beta', 'gamma', 'delta', '0', '1', '2']

# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

## <font color='orange'> Read / Load a CSV from the local drive

In [27]:
df = pd.read_csv(r"E:\ANALYTICS TRAINING\Data\Wage.csv") #r: Raw string literals — treats backslashes literally

In [None]:
df.loc[0:12, ['wage','age']]

In [None]:
ss=df.iloc[0:6,[3,5,10,0]]
print(ss)
print(type(ss))

## <font color='orange'> Data Information

The DataFrames object has a method called info(), that gives you more information about the data set.

In [None]:
print(df)

In [28]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          3000 non-null   int64  
 1   year        3000 non-null   int64  
 2   age         3000 non-null   int64  
 3   maritl      3000 non-null   object 
 4   race        3000 non-null   object 
 5   education   3000 non-null   object 
 6   region      3000 non-null   object 
 7   jobclass    3000 non-null   object 
 8   health      3000 non-null   object 
 9   health_ins  3000 non-null   object 
 10  logwage     3000 non-null   float64
 11  wage        3000 non-null   float64
 12  xx          3000 non-null   int64  
 13  yy          3000 non-null   bool   
dtypes: bool(1), float64(2), int64(4), object(7)
memory usage: 307.7+ KB
None


In [29]:
print(df.dtypes)

ID              int64
year            int64
age             int64
maritl         object
race           object
education      object
region         object
jobclass       object
health         object
health_ins     object
logwage       float64
wage          float64
xx              int64
yy               bool
dtype: object


In [30]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

### <font color='pink'>Change the data type for a single / multiple variables

In [33]:
df['maritl'] = df['maritl'].astype('object')

df[['race','education']] = df[['race','education']].astype('object')

df[['region','jobclass','health','health_ins']] = df[['region','jobclass','health','health_ins']].astype('object')

df['year']=df['year'].astype('category')

df['ID']=df['ID'].astype('string')

In [34]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   ID          3000 non-null   string  
 1   year        3000 non-null   category
 2   age         3000 non-null   int64   
 3   maritl      3000 non-null   object  
 4   race        3000 non-null   object  
 5   education   3000 non-null   object  
 6   region      3000 non-null   object  
 7   jobclass    3000 non-null   object  
 8   health      3000 non-null   object  
 9   health_ins  3000 non-null   object  
 10  logwage     3000 non-null   float64 
 11  wage        3000 non-null   float64 
 12  xx          3000 non-null   int64   
 13  yy          3000 non-null   bool    
dtypes: bool(1), category(1), float64(2), int64(2), object(7), string(1)
memory usage: 287.6+ KB
None


In [None]:
print(df.dtypes)

In [35]:
print(df.shape)
print(df.ndim)
print(df.size)

(3000, 14)
2
42000


In [36]:
print(df.head(5))

       ID  year  age            maritl      race        education  \
0  231655  2006   18  1. Never Married  1. White     1. < HS Grad   
1   86582  2004   24  1. Never Married  1. White  4. College Grad   
2  161300  2003   45        2. Married  1. White  3. Some College   
3  155159  2003   43        2. Married  3. Asian  4. College Grad   
4   11443  2005   50       4. Divorced  1. White       2. HS Grad   

               region        jobclass          health health_ins   logwage  \
0  2. Middle Atlantic   1. Industrial       1. <=Good      2. No  4.318063   
1  2. Middle Atlantic  2. Information  2. >=Very Good      2. No  4.255273   
2  2. Middle Atlantic   1. Industrial       1. <=Good     1. Yes  4.875061   
3  2. Middle Atlantic  2. Information  2. >=Very Good     1. Yes  5.041393   
4  2. Middle Atlantic  2. Information       1. <=Good     1. Yes  4.318063   

         wage  xx     yy  
0   75.043154   0  False  
1   70.476020   0  False  
2  130.982177   1   True  
3  154.6

In [38]:
print("last five rows (default)")
print(df.tail())
print("last two rows")
df.tail(2)

last five rows (default)
          ID  year  age            maritl      race        education  \
2995  376816  2008   44        2. Married  1. White  3. Some College   
2996  302281  2007   30        2. Married  1. White       2. HS Grad   
2997   10033  2005   27        2. Married  2. Black     1. < HS Grad   
2998   14375  2005   27  1. Never Married  1. White  3. Some College   
2999  453557  2009   55      5. Separated  1. White       2. HS Grad   

                  region       jobclass          health health_ins   logwage  \
2995  2. Middle Atlantic  1. Industrial  2. >=Very Good     1. Yes  5.041393   
2996  2. Middle Atlantic  1. Industrial  2. >=Very Good      2. No  4.602060   
2997  2. Middle Atlantic  1. Industrial       1. <=Good      2. No  4.193125   
2998  2. Middle Atlantic  1. Industrial  2. >=Very Good     1. Yes  4.477121   
2999  2. Middle Atlantic  1. Industrial       1. <=Good     1. Yes  4.505150   

            wage  xx     yy  
2995  154.685293   1   True  
2

Unnamed: 0,ID,year,age,maritl,race,education,region,jobclass,health,health_ins,logwage,wage,xx,yy
2998,14375,2005,27,1. Never Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,2. >=Very Good,1. Yes,4.477121,87.981033,1,True
2999,453557,2009,55,5. Separated,1. White,2. HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.50515,90.481913,1,True


In [39]:
print(df.columns)    # for reading the headers

Index(['ID', 'year', 'age', 'maritl', 'race', 'education', 'region',
       'jobclass', 'health', 'health_ins', 'logwage', 'wage', 'xx', 'yy'],
      dtype='object')


In [40]:
df['race'].unique() #Getting the levels of a categorical data (object)

array(['1. White', '3. Asian', '4. Other', '2. Black'], dtype=object)

In [42]:
df.education.value_counts() #counts for levels of a categorical data

education
2. HS Grad            971
4. College Grad       685
3. Some College       650
5. Advanced Degree    426
1. < HS Grad          268
Name: count, dtype: int64

### <font color='pink'> changing value of a cell using iat or at

In [48]:
df.head(2)

Unnamed: 0,ID,year,age,maritl,race,education,region,jobclass,health,health_ins,logwage,wage,xx,yy
0,231655,2006,100,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154,0,False
1,86582,2004,24,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.47602,0,False


In [46]:
print(df.iat[0,2])
df.iat[0,2]=555
print(df.iat[0,2])

18
555


In [47]:
print(df.at[0,'age']) 
df.at[0,'age']  = 100
print(df.at[0,'age']) 
df.at[2, 'age'] = 200
print(df.at[2,'age']) 

555
100
200


## <font color='Yellow'> DATA HANDLING

## <font color='orange'> Subsetting and Transformations

### <font color='pink'> Column Name and / or Indices

In [None]:
print(df.age) #Extract the column "age"

In [None]:
df['age']

In [None]:
print(df['age'][0:6])    # give the column name for printing all the elements from the chosen rows

In [None]:
print(df[['age', 'maritl', 'race']])  # giving multiple column names inside lists - two square brackets

In [None]:
df.race.value_counts() #Count the number of items in each level of a categorical variable

In [None]:
df[['health','education','race']].value_counts().sort_index()

In [None]:
df[['education','race','health']].value_counts().sort_index()

In [None]:
df[['health','education','race']].value_counts().sort_values(ascending=False)

In [None]:
df.dtypes

## <font color='orange'> Locate Row

A DataFrame is a 2-dimensional table with rows and columns.

Pandas use the **loc** attribute to return one or more specified row(s)

In [50]:
print(df.loc[0]) #Return row 0
print("output2")
print(df.loc[[0, 1]]) #Return rows 0 and 1
print("output3")
print(df.loc[0:5]) #Return rows 0 to 5 (both inclusive)

ID                        231655
year                        2006
age                          100
maritl          1. Never Married
race                    1. White
education           1. < HS Grad
region        2. Middle Atlantic
jobclass           1. Industrial
health                 1. <=Good
health_ins                 2. No
logwage                 4.318063
wage                   75.043154
xx                             0
yy                         False
Name: 0, dtype: object
output2
       ID  year  age            maritl      race        education  \
0  231655  2006  100  1. Never Married  1. White     1. < HS Grad   
1   86582  2004   24  1. Never Married  1. White  4. College Grad   

               region        jobclass          health health_ins   logwage  \
0  2. Middle Atlantic   1. Industrial       1. <=Good      2. No  4.318063   
1  2. Middle Atlantic  2. Information  2. >=Very Good      2. No  4.255273   

        wage  xx     yy  
0  75.043154   0  False  
1  70.476020

## <font color='orange'> *loc* and column names

In [51]:
singcol=df.loc[0:5]['age']
print(singcol)
print(type(singcol))

0    100
1     24
2    200
3     43
4     50
5     54
Name: age, dtype: int64
<class 'pandas.core.series.Series'>


In [52]:
df.loc[0:5][['age','race']] #Multiple columns

Unnamed: 0,age,race
0,100,1. White
1,24,1. White
2,200,1. White
3,43,3. Asian
4,50,1. White
5,54,1. White


In [53]:
df[['age','race']][0:5]

Unnamed: 0,age,race
0,100,1. White
1,24,1. White
2,200,1. White
3,43,3. Asian
4,50,1. White


## <font color='orange'> Usage of *iloc*

*iloc* is used to access data by row and column indices as **integers**

Syntax: df.iloc[row_index, column_index]

The row and column indices must be integers

In [54]:
print(df.iloc[2:6])  # print elements from specified range

       ID  year  age       maritl      race        education  \
2  161300  2003  200   2. Married  1. White  3. Some College   
3  155159  2003   43   2. Married  3. Asian  4. College Grad   
4   11443  2005   50  4. Divorced  1. White       2. HS Grad   
5  376662  2008   54   2. Married  1. White  4. College Grad   

               region        jobclass          health health_ins   logwage  \
2  2. Middle Atlantic   1. Industrial       1. <=Good     1. Yes  4.875061   
3  2. Middle Atlantic  2. Information  2. >=Very Good     1. Yes  5.041393   
4  2. Middle Atlantic  2. Information       1. <=Good     1. Yes  4.318063   
5  2. Middle Atlantic  2. Information  2. >=Very Good     1. Yes  4.845098   

         wage  xx     yy  
2  130.982177   1   True  
3  154.685293   1   True  
4   75.043154   0  False  
5  127.115744   1   True  


In [55]:
print(df.iloc[2:6, 1]) # Extract the required index / column

print(df.iloc[2:6,1:3])

2    2003
3    2003
4    2005
5    2008
Name: year, dtype: category
Categories (7, string): [2003, 2004, 2005, 2006, 2007, 2008, 2009]
   year  age
2  2003  200
3  2003   43
4  2005   50
5  2008   54


In [59]:
print(df.iloc[1, 1])    # Extract 3rd row and 2nd column - passing index (rows) and column numbers as arguments

2004


In [58]:
df.head()

Unnamed: 0,ID,year,age,maritl,race,education,region,jobclass,health,health_ins,logwage,wage,xx,yy
0,231655,2006,100,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154,0,False
1,86582,2004,24,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.47602,0,False
2,161300,2003,200,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.875061,130.982177,1,True
3,155159,2003,43,2. Married,3. Asian,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,1. Yes,5.041393,154.685293,1,True
4,11443,2005,50,4. Divorced,1. White,2. HS Grad,2. Middle Atlantic,2. Information,1. <=Good,1. Yes,4.318063,75.043154,0,False


## <font color='orange'> Difference between iloc and loc

In [61]:
eg_df = pd.DataFrame({
    'SNo': [101, 102, 103],
    'Mark': [45, 84, 63]}, 
    index=['S1', 'S2', 'S3']) 
eg_df

Unnamed: 0,SNo,Mark
S1,101,45
S2,102,84
S3,103,63


In [62]:
eg_df.iloc[1, 1] #Extract 2nd row and 2nd column - indices are integers

np.int64(84)

In [63]:
eg_df.loc["S2", "Mark"] #Labels are used to extract 

np.int64(84)

## <font color='orange'> Selected Rows, all Columns OR Selected Columns, all Rows                        

In [None]:
print(df.iloc[2:6,:])

In [None]:
print(df.iloc[ :,0:2])

In [None]:
for index, row in df.iterrows():            #Iterate over DataFrame rows
  print(index, row['age'])   # or just put print(index, row)

## <font color='orange'> Filter (subset) using *loc*

In [None]:
df.loc[df['race']=='1. White']

In [None]:
df.loc[(df['race']=='1. White')| (df['race']=='2. Black')] # Usage of or logic - "|"

In [None]:
df.loc[(df['race']=='1. White')& (df['age']%3!=0)] # Usage of and logic - "&"

In [None]:
df.loc[(df['race']=='1. White')& (df['age']%2==0)]

In [None]:
df.loc[((df['race']=='1. White')|(df['race']=='2. Black')) & (df['age']%20==0)]

## <font color='orange'> *loc* within a column

Filtering can be done on a single or multiple columns conditioned on other columns

In [None]:
ex1=df.loc[(df['age']%8==0)&(df['age']>24)&(df['wage']<1000)] #Whole data set

print(ex1)

In [None]:
ex2=df['race'].loc[(df['age']%2==0)&(df['age']>10)&(df['wage']<1000)] #Only one column

print(ex2)
print(ex2.unique())

In [None]:
df[['race','maritl']].loc[(df['wage']<1000)&(df['age']>25)]

In [None]:
df.loc[df['cycle']%28==0]

## <font color='orange'>Sort a DF based on a column

In [None]:
df.sort_values('race', ascending = False) #default: True

### <font color='pink'> sorting by one or more columns

In [None]:
df_sort1=df.sort_values(by=['race'])
df_sort1.head()

In [None]:
df_sort2=df.sort_values(by=['race','age'])
df_sort2.head()

In [None]:
df_sort3=df.sort_values(by=['age','race'], ascending = False)
df_sort3.head(20)

In [None]:
df_sort3=df.sort_values(by=['age','wage'], ascending = [True,False])
df_sort3.head(20)

## <font color='orange'> Creating a new column

### <font color='pink'> Addition of existing two numeric columns

In [None]:
df['Total'] = df['wage'] + df['logwage']
print(df['Total'])

In [None]:
#Alternately, use column index
df1=df.iloc[:, 10]+df.iloc[:, 11]
print(df1)

In [None]:
#or
df2=df['Total'] = df.iloc[:, 10:12].sum(axis=1)
print(df1)

### <font color='pink'> Using lambda function

In [None]:
file1 = pd.DataFrame({
    'name': ['Amba','Baskar','Charlie','Darwin'],
    'age': [25,66,56,78]
})

In [None]:
file1.assign(
    is_senior = lambda dataframe: dataframe['age'].map(lambda x: "Senior" if x >= 65 else "NonSenior"),
    is_senior1 = lambda dataframe: dataframe['age'].map(lambda x: True if x >= 65 else False)
)

In [None]:
file1.assign(
    is_senior = lambda dataframe: dataframe['age'].map(lambda age: True if age >= 65 else False) ,
    ).assign(
    name_uppercase = lambda dataframe: dataframe['name'].map(lambda name: name.upper()),
    ).assign(
    name_uppercase_double = lambda dataframe: dataframe['name_uppercase'].map(lambda name: name.upper()+"-"+name.upper()))

In [None]:
#Generate a subset of a DF  
rr=np.random.choice(np.arange(100, 226), size=5, replace=False) #randomly generate numbers without replacement
print(rr)
df_1=df[df.index.isin(rr)]
print(df_1)

## <font color='orange'> Named Index in a DF

In [None]:
#recall this df created earlier
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
EX = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(EX)

In [None]:
print(EX.loc["day2"])

print(EX.loc[["day2",'day3']])

## <font color='orange'> Transpose of a DF

In [None]:
credit1 = {'Income': [14.891, 106.025, 104.593, 148.924,  55.882],
        'Age': [34, 82, 71, 36, 68],
        'Gender': ["Male",  "Female",  "Male", "Female",  "Male"],
        'Balance': [333, 903, 580, 964, 331]}

credit = pd.DataFrame(credit1)

pd.options.display.max_rows = None #Can set the number of columns to print

print(credit)

print(credit.T)

## <font color='orange'> Dropping columns / rows

In [None]:
drop_col=['Age','Gender']

cr_dropped=credit.drop(columns=drop_col,index = [0,1])

print(cr_dropped)

## <font color='orange'> Rename the columns and index (row names)

In [None]:
cr=credit.rename(columns={"Income": "In", "Age": "Ag", "Gender": "Ge","Balance":"Ba"})
print(cr.columns)

In [None]:
credit.T.rename(index={"Income": "In", "Age": "Ag", "Gender": "Ge","Balance":"Ba"})

In [None]:
credit.rename(index={0: "R0", 1: 'R1',2: 'R2', 3: 'R3',4:"R4",5:"R5"})

## <font color='orange'> Reorder the columns and index (row names)

In [None]:
new_col_order = ['Age', 'Income']

print(credit[new_col_order])

In [None]:
new_col_order = ['Gender', 'Age', 'Income','Balance']

print(credit[new_col_order])

In [None]:
new_ind_order = [0,2,3,4,1]
credit_ind_reord = credit.loc[new_ind_order]
print(credit_ind_reord)

## <font color='orange'> Limit the number of rows / columns to print

In [None]:
pd.options.display.max_rows = 2 
pd.options.display.max_columns = 2

print(credit)

print(credit.T)

## <font color='orange'> Creating a column using assign()

In [None]:
credit_1=credit.assign(b_a=(credit.Balance/credit.Age))
print(credit_1)

## <font color='orange'>  Adding rows of a DF using concat()

In [None]:
credit2 = pd.DataFrame([[58, 67,"Male",500], [37, 28,"Female",900]],columns=list(credit.columns),index=["n1","n2"])
credit3=pd.concat([credit, credit2], ignore_index=False)
print(credit3)

## <font color='orange'>  Conditional subset

In [None]:
print(credit.loc[credit.Age < 40])

print("only one column")
print(credit.loc[credit.Age < 40,'Age'])

print("More than one column")
print(credit.loc[credit.Age < 40,['Balance',"Gender"]])

In [None]:
#Categorical data
credit[credit.Gender == 'Male']

In [None]:
#Categorical data
df[df['Gender'].isin(['Male'])]

In [None]:
#Categorical data
df[df['race'].isin(['1. White', '3. Asian'])]

## <font color='orange'>  Replace existing value of a column by a new value 

In [None]:
credit['Gender'].replace({'M': 'm', 'Female': 'F'}, inplace=True)
print(credit)

## <font color='orange'>  Changing the rows and column names - Existing DF

In [None]:
credit.index=list("ABCDE")
print(credit)

credit.index=['R1','R2','R3','R4','R5']
print(credit)


credit.columns=['happy','sad','normal','angry']
print(credit)

In [None]:
# For a new DF - default index and column 

somenum=np.random.randn(5,4)

dafr=pd.DataFrame(somenum)
print(dafr)

# For a new DF - user defined index and column 

dafr1 = pd.DataFrame(somenum, index=list("ABCDE"), columns=['happy','sad','normal','angry'])
print(dafr1)

### <font color='pink'> Writing to a csv

In [None]:
dafr1.to_csv('E:\\testEG.csv', index = None, header=True)
dafr1.to_csv('E:\\testEG2.csv', index = True, header=True)
dafr1.to_csv('E:\\testEG3.csv', index = True, header=False)

## <font color='orange'>  Missing Values - A Quick Introduction

In [None]:
df_mv=pd.read_csv(r"E:\ANALYTICS TRAINING\Data\dat_MissVal.csv")
df_mv

### <font color='pink'> Identify the location / size of MVs

In [None]:
na1=np.where(pd.isnull(df_mv))

print(na1)

for row, col in zip(na1[0], na1[1]):
    print(f"Row {row}, Column {df_mv.columns[col]}")
    
#number of Missing values
print(np.array(na1[0]).size)

print(np.array(na1).size)

print(len(na1))

In [None]:
df_mv["A"].mean()

In [None]:
df_mv.mean(skipna=False)

In [None]:
df_mv.isna()

In [None]:
df_mv.dropna()

In [None]:
df_mv.fillna(value = 5)

# <font color = yellow> Summary PANDAS Major Functions

## <font color = 'orange'> Import libraries **pandas as pd** and **numpy as np**

1. <font color = 'pink'>**Loading a data set in csv (as a DataFrame,df)**

1. <font color = 'pink'>**Writing csv**

1. <font color = 'pink'>**Know the structure**

      df.shape (df.size, optional,dimension for rectangular data is always 2)

1. <font color = 'pink'>**Know the names of columns**

      df.columns

1. <font color = 'pink'>**To view few rows**

      * df
      * df.head()
      * df.tail()
      
        - inside ( ) type any positive integer to get as many rows

1. <font color = 'pink'>**Know the structure of DF (nature of variables)**

      df. dtypes (df.info(verbose=True) is optional)

    Major Data types

      * float64
      * int64
      * category
      * object

1. <font color = 'pink'>**Change the nature of a variable**

      df['column'] = df['column'].astype( )

1. <font color = 'pink'>**Know the presence of *NA* or *NaN***
      * pd.isna(df)
      * pd.isnull(df)
      * df.isna()
      * df.isnull()
      
      To know the row / column indices - presence of NA
      * np.where(pd.isna(ecg_data))

      Row / column indices - presence of NA as a DF
      * pd.DataFrame(list( ),index=["R","C"]).T

        *R and C are column names and T is to transpose*

1. <font color = 'pink'>**To *sort* by columns**
      
      df.sort_values(by=['C1','C2],axis=0, ascending=True, na_position= ‘last’)

1. <font color = 'pink'>**To *sort* by rows**      
      
      df.sort_values(by=index values,axis=1, ascending=True, na_position= ‘last’), index values are row names

1. <font color = 'pink'>**To *change* the nature of a variable**
      
      df.select_dtypes(' ') - inside (' ') type the required nature (like float64)

1. <font color = 'pink'>**To *count* based on distinct value (numeric) or distinct level of a categorical variable**

      df.C.value_counts() - C is the name of the column (variable) to count

1. <font color = 'pink'>**To *replace* a value**
      
      df.replace(existing value,new value)

1. <font color = 'pink'>**To *create* a new variable**

      df['new variable'] = any suitable operations (changes can be made in a same variable, instead of *new variable* use the existing variable name)


1. <font color = 'pink'>**To *Subset* and *Conditional Subset* **

1. <font color = 'pink'>**To *Rename* a column or row**

1. <font color = 'pink'>**To *reorder* columns**

1. <font color = 'pink'>**Conditionally create a column (variable)**  

#### <font color='yellowgreen'> using if, elif, else

for i in df.index:

if (df.loc[i,"column"] condition 2):

     df.loc[i,'new column']=value 1
     
elif((df.loc[i,"column"] condition 2) & (df.loc[i,"new_column"] condition 3)):

     df.loc[i,'new_column']=value 2

else:
    
    df.loc[i,'new column']= value 3

#### <font color='yellowgreen'> Using conditions and choices and np.select

conditions = [(df["column"].condition 1),
    
    (df["column"].condition 2 & df["column"].condition 3),
    
    (df["column"].condition 4]

choices = [value 1, value 2]

df["new column"] = np.select(conditions, choices)


#### <font color='yellowgreen'> Using np.where

df['new column']=np.where((df["column"]condition 1),value 1,np.where((df['column'] condition 2) & (df['column'] condition 3), value 2,'value 3'))

#### <font color='yellowgreen'> Remark - while  adding columns conditionally

<font color="pink"> condition 1,2,3... may be the required logical (lt(10), ge(10), == etc) expressions the esixting column would satisfy

<font color="pink"> Value 1, 2.. are corresponding numeric or string values the new column will assume  (nmeric or string uniformly)

<font color="pink"> df["new column"].value_counts(sort=False) to get the counts as per the order of definition

<font color = 'pink'> **Subset / Slicing /Filtering**
      
      d1[1,2] - First row and 3rd column

      d1.iloc[0:2,] - first two rows and all columns

      d1.iloc[2:,] - all the rows from third row and all columns

      d1.iloc[:,1] - second column and all rows

#### <font color='yellowgreen'> Conditional (usage of "&", "|" (and , or))
      
      df.loc[(df['var'] >= ) & (df['var'] < )| (df['var]== )]