# **Pandas Objects Tutorial**

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.



In [None]:
import numpy as np
import pandas as pd

**Object Creation:**


*   Series.
*   Dataframe.



**What's a Series?**

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>s = pd.Series(data, index=index)

Here, data can be many different things:

*   Python Dict
*   an ndarray (n-dimensional array object defined in the numpy which stores the collection of the similar type of elements.)
*   a scalar value

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

From ndarray

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [None]:
a =[4,2,3,4] #list
s=pd.Series(a)
print(s)
print (type(s))

0    4
1    2
2    3
3    4
dtype: int64
<class 'pandas.core.series.Series'>


*Notice* in the below code because the list a has different types of items. s dtype is considered of type object

A data type object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the following aspects of the data: Type of the data (integer, float, Python object, etc.)

In [None]:
a=[1,6,8,"mira"]
s= pd.Series(a)
print(s)
print (type(s))

0       1
1       6
2       8
3    mira
dtype: object
<class 'pandas.core.series.Series'>


In [None]:
s= pd.Series(np.random.randn(5), index=["a", "b", "c", "e",'f'])

s

a   -1.833934
b    0.253784
c   -0.138942
e   -0.123098
f   -0.191769
dtype: float64

In [None]:
s.index

Index(['a', 'b', 'c', 'e', 'e'], dtype='object')

Note:
pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.

**Series intiated from dicts:**

In [None]:
d = {"b": 1, "a": 0, "c": 2}

pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [None]:
pd.Series(d, index=["b", "c", "d", "a"])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

**Series intiated from scalar value:**

In [None]:
pd.Series(1, index=["a", "b", "c", "d", "e"])

a    1
b    1
c    1
d    1
e    1
dtype: int64

**Series is ndarray-like**

Series acts very similarly to a ndarray and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.

In [None]:
s[0]


-0.9170768665264409

In [None]:
s[:3]

a   -0.917077
b    0.561694
c   -0.743124
dtype: float64

In [None]:
s[s > s.median()]

b    0.561694
e   -0.101201
dtype: float64

In [None]:
s[[4, 3, 1]]

f   -0.915944
e    1.183716
b    0.944478
dtype: float64

**Vectorized operations and label alignment with Series**

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [None]:
s + s

a   -3.667868
b    0.507568
c   -0.277883
e   -0.246196
f   -0.383539
dtype: float64

In [None]:
s * 2

a   -1.834154
b    1.123388
c   -1.486248
e   -0.202402
e   -1.585535
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

In [None]:
s[1:] + s[:-1]

a         NaN
b    1.888956
c   -1.436390
e    2.367432
f         NaN
dtype: float64

In [None]:
s[-1]
s

a   -0.917077
b    0.561694
c   -0.743124
e   -0.101201
e   -0.792767
dtype: float64

**What's a DataFrame?**

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

*   Dict of 1D ndarrays, lists, dicts, or Series
*   2-D numpy.ndarray
*   Structured or record ndarray
*   A Series
*   Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

In [None]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0,5.0], index=["a", "b", "c","d"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c","e"]),
}
df = pd.DataFrame(d)

df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,5.0,
e,,4.0


In [None]:
pd.DataFrame(d, index=["d", "b", "a"])

Unnamed: 0,one,two
d,5.0,
b,2.0,2.0
a,1.0,1.0


In [None]:
pd.DataFrame(d, index=["d", "b", "a"], columns=["two"])

Unnamed: 0,two
d,
b,2.0
a,1.0


In [None]:
df.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [None]:
df.columns

Index(['one', 'two'], dtype='object')

**Column selection, addition, deletion**

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [None]:
df["one"]


a    1.0
b    2.0
c    3.0
d    5.0
e    NaN
Name: one, dtype: float64

In [None]:
df["three"] = df["one"] * df["two"]
df["three"]

a    1.0
b    4.0
c    9.0
d    NaN
e    NaN
Name: three, dtype: float64

In [None]:
df.columns

Index(['one', 'two', 'three'], dtype='object')

In [None]:
df["flag"] = df["one"] > 2
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,5.0,,,True
e,,4.0,,False


In [None]:
del df["two"]

df

Unnamed: 0,one,three,flag
a,1.0,1.0,False
b,2.0,4.0,False
c,3.0,9.0,True
d,5.0,,True
e,,,False


When inserting a scalar value, it will naturally be propagated to fill the column:

In [None]:
df["foo"] = "bar"
df

Unnamed: 0,one,three,flag,foo
a,1.0,1.0,False,bar
b,2.0,4.0,False,bar
c,3.0,9.0,True,bar
d,5.0,,True,bar
e,,,False,bar


When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

In [None]:
df["one_trunc"] = df["one"][:2]

df

Unnamed: 0,one,three,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,4.0,False,bar,2.0
c,3.0,9.0,True,bar,
d,5.0,,True,bar,
e,,,False,bar,


Creating a DataFrame by passing a NumPy array, with a datetime index using date_range() and labeled columns:

In [None]:
dates = pd.date_range("20130101", periods=5)

dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05'],
              dtype='datetime64[ns]', freq='D')

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df

Unnamed: 0,A,B,C,D
2013-01-01,-1.336949,0.829048,0.342764,-0.072664
2013-01-02,1.071194,-0.660062,-1.469449,0.287944
2013-01-03,-0.136188,0.02908,-0.093194,-0.434824
2013-01-04,-0.272667,1.304666,-2.531193,0.380609
2013-01-05,0.569271,1.310273,-0.823701,-1.066897
2013-01-06,0.051338,0.066398,0.949704,-0.313183


Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

In [None]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"]
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)


    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.

In [None]:
from google.colab import files
upload = files.upload()

Saving titanic.csv to titanic (1).csv


In [None]:
df = pd.read_csv('loan_data_set.csv')
df.head(4) # you can add as much rows as you need Ex: print(df.head(30))   print(df.head(2))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


In [None]:
df.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
888,889,0.0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.655654,1,2,W./C. 6607,23.45,S
889,890,1.0,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C
890,891,0.0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,Q


In [None]:
df.shape

(891, 12)

In [None]:
type(df['Dependents'][1])

numpy.float64

In [None]:
df["Loan_Amount_Term"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [None]:
df["Loan_Amount_Term"].value_counts()

360.0    512
180.0     44
480.0     15
300.0     13
240.0      4
84.0       4
120.0      3
60.0       2
36.0       2
12.0       1
Name: Loan_Amount_Term, dtype: int64

In [None]:
df["Education"].unique()

array(['Graduate', 'Not Graduate'], dtype=object)

In [None]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 57.7+ KB
None


----------------------------------------------------------------------

**Pandas - Cleaning Data**

Data cleaning means fixing bad data in your data set.

Bad data could be:

*   Empty cells
*   Wrong data
*   Data in wrong format
*   Duplicates


**Empty Cells**

Empty cells can potentially give you a wrong result when you analyze data.

**Remove Rows**
One way to deal with empty cells is to remove rows that contain empty cells.

This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.

In [None]:
#  Check missing values
# ------------------------
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Loan_Status           0
dtype: int64

In [None]:
len(df)

614

In [None]:
df = pd.read_csv('loan_data_set.csv')
#df.dropna(inplace=True)                                   # option_1
#df.dropna(subset=['Married'], inplace=True)               # option_2
#df.drop('Credit_History', axis=1, inplace=True)           # option_3
#df.fillna(0, inplace=True)                                # option_4
#df['Dependents'].fillna(0, inplace=True)                  # option_5

#median = df["Loan_Amount_Term"].median()
#df["Loan_Amount_Term"].fillna(median, inplace=True)        # option_6

df.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Y


**Delete any row has empty value at any cell in any column**
Note: By default, the dropna() method returns a new DataFrame, and will not change the original.

If you want to change the original DataFrame, use the inplace = True argument:

**Delete any row has empty value at any cell in specific column**

**Delete Specific column**
using axis=1 as it is column-axis

**Replace Empty Values**
Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty cells.

The fillna() method allows us to replace empty cells with a value:

**Replace Only For Specified Columns**
The example above replaces all empty cells in the whole Data Frame.

To only replace empty values for one column, specify the column name for the DataFrame:

**Replace Using Mean, Median, or Mode**
A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

-------------------------------------------------------------------------------

**Pandas - Cleaning Data of Wrong format**

Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.

Convert Into a Correct Format
In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date' column should be a string that represents a date:

In [None]:
df = pd.read_csv('loan_data_set.csv')
df.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Y
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,N


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 57.7+ KB


In [None]:
df['Dependents'].unique()

array(['0', '1', '2', '3+', 0], dtype=object)

**Wrong Data**

"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone registered "199" instead of "1.99".

Sometimes you can spot wrong data by looking at the data set, because you have an expectation of what it should be.

If you take a look at our data set, you can see 3+ at 'Dependents' column.

**Replacing Values**

One way to fix wrong values is to replace them with something else.

In our example, it is most likely a typo, and the value should be "45" instead of "450", and we could just insert "45" in row 7:

In [None]:
df.loc[7, 'Dependents'] = 3
df.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Y
7,LP001014,Male,Yes,3,Graduate,No,3036,2504.0,158.0,360.0,0.0,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,N


In [None]:
df['Dependents'][7]=4
df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Dependents'][7]=4


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Y
7,LP001014,Male,Yes,4,Graduate,No,3036,2504.0,158.0,360.0,0.0,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,N


In [None]:
df = pd.read_csv('loan_data_set.csv')
df.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Y
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,N


In [None]:
for x in df.index:
    if df.loc[x, "Dependents"] == '3+':
        df.loc[x, "Dependents"] = '3'

df.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Y
7,LP001014,Male,Yes,3,Graduate,No,3036,2504.0,158.0,360.0,0.0,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,N


In [None]:
df = pd.read_csv('loan_data_set.csv')

for x in df.index:
    if df.loc[x, "Dependents"] == '3+':
        df.drop(x, inplace = True)

In [None]:
len(df)

563

In [None]:
df['Dependents']=df['Dependents'].astype(str).astype(int)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         614 non-null    int32  
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Loan_Status        614 non-null    object 
dtypes: float64(4), int32(1), int64(1), object(6)
memory usage: 55.3+ KB


---------------------------------------------------------------------------------

**Removing Duplicates**

Duplicate rows are rows that have been registered more than one time.

To discover duplicates, we can use the duplicated() method.

The duplicated() method returns a Boolean values for each row:

In [None]:
print(df.duplicated())

0      False
1      False
2      False
3      False
4      False
       ...  
608    False
609    False
611    False
612    False
613    False
Length: 563, dtype: bool


**Pandas - Data Correlations**

In [None]:
df.corr()

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
Dependents,1.0,0.118202,0.03043,0.166106,-0.102028,-0.038702
ApplicantIncome,0.118202,1.0,-0.116605,0.570909,-0.045306,-0.014715
CoapplicantIncome,0.03043,-0.116605,1.0,0.188619,-0.059878,-0.002056
LoanAmount,0.166106,0.570909,0.188619,1.0,0.039447,-0.008433
Loan_Amount_Term,-0.102028,-0.045306,-0.059878,0.039447,1.0,0.00147
Credit_History,-0.038702,-0.014715,-0.002056,-0.008433,0.00147,1.0
