# How to install pandas?
!pip install pandas


In [1]:
# You can import pandas
import pandas as pd

# 2. Data Structures
Introduction to Data Structures

Pandas has two data structures as follows:
1. A Series is 1 -dimensional labeled array that can hold data of any type(integer, string, boolean, float, python objects, and so on). It's axis labels are collectively called an index.
2. A DataFrame is 2-dimensional labeled data structure with columns. It supports multiple datatypes.

# 3. Pandas Series
Introduction to Pandas Series and Creating Series

Pandas Series is a one-dimensional labeled array capable of holding any data type. However, a series is a sequence of homogeneous data types, similar to an array, list, or column in a table.

It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N , where N is the length of the Series minus one.

Creating a Series
1. To create a numeric series

In [2]:
# Create a numeric series
numbers = range(1, 100 , 5)
pd.Series(numbers)

0      1
1      6
2     11
3     16
4     21
5     26
6     31
7     36
8     41
9     46
10    51
11    56
12    61
13    66
14    71
15    76
16    81
17    86
18    91
19    96
dtype: int64

In Python, the row names are called 'index'

2. To create an object Series

In [6]:
# Create a object series
string = "HI" , "How", "are"
pd.Series(string)

0     HI
1    How
2    are
dtype: object

3. To create a series by giving both numeric and string values

In [7]:
# Create a Series with an arbitrary list
s = pd.Series([345, 'Lohore',34.6, 'Happay Birthday'])
s

0                345
1             Lohore
2               34.6
3    Happay Birthday
dtype: object

4. To set index values for a series

In [8]:
marks = [60, 89, 74, 86]
subject = ["Math", "Scinece", "English" ,"Urdu"]
pd.Series(marks, index= subject)

Math       60
Scinece    89
English    74
Urdu       86
dtype: int64

5. To create a series from a dictionary

In [12]:
data = {'Math':60, 'Science':69, 'English':79, 'Urdu':89}
pd.Series(data)

Math       60
Science    69
English    79
Urdu       89
dtype: int64

6. A series with missing Values


If we pass a key that is not defined then its value will be NAN.

In [14]:
subjects = ["Math", "Science", "English" ,"Urdu","Computer"]
marks_series = pd.Series(data,  index = subjects)
print(marks_series)

Math        60.0
Science     69.0
English     79.0
Urdu        89.0
Computer     NaN
dtype: float64


# 3.2 Manipulating Series
Manipulating Series

1. To check for null values using .isnull

In [15]:
marks_series.isnull()

Math        False
Science     False
English     False
Urdu        False
Computer     True
dtype: bool

2. To check for null values using .notnull

In [16]:
marks_series.notnull()

Math         True
Science      True
English      True
Urdu         True
Computer    False
dtype: bool

3. To know the subjects in which makrs score is more than 75

In [17]:
marks_series[marks_series > 75]

English    79.0
Urdu       89.0
dtype: float64

4. To assign 68 makrs to 'Computer'

In [18]:
marks_series["Computer"] = 68
marks_series

Math        60.0
Science     69.0
English     79.0
Urdu        89.0
Computer    68.0
dtype: float64

5. To check whether Maths marks are 73

In [19]:
marks_series.Math == 73

False

In [20]:
# Or you may use

marks_series["Math"] == 73

False

6. Sorting  a numeric series

In [32]:
# Create a pandas series
import numpy as np
values = pd.Series([23, 45 , np.nan , 41 , 34, 56, np.nan, 20, 14,56])
values

0    23.0
1    45.0
2     NaN
3    41.0
4    34.0
5    56.0
6     NaN
7    20.0
8    14.0
9    56.0
dtype: float64

# Ascending order

In [25]:
values.sort_values(ascending = True)

8    14.0
7    20.0
0    23.0
4    34.0
3    41.0
1    45.0
5    56.0
9    56.0
2     NaN
6     NaN
dtype: float64

# Decending order

In [26]:
values.sort_values(ascending = False)

5    56.0
9    56.0
1    45.0
3    41.0
4    34.0
0    23.0
7    20.0
8    14.0
2     NaN
6     NaN
dtype: float64

7. Sorting a categorical series

In [27]:
# Create a pandas series
string_values = pd.Series(["a" , "j" , "d" , "a"])

string_values

0    a
1    j
2    d
3    a
dtype: object

In [28]:
# Ascending order
string_values.sort_values(ascending = True )

0    a
3    a
2    d
1    j
dtype: object

In [29]:
# Descending order
string_values.sort_values(ascending=False)

1    j
2    d
0    a
3    a
dtype: object

8. Rank a Series

In [30]:
# Recall the marks_Series
marks_series.rank( ascending=True, pct= False)

Math        1.0
Science     3.0
English     4.0
Urdu        5.0
Computer    2.0
dtype: float64

# 4. Pandas DataFrames

Introduction to Dataframes and Creating Dataframes

A DataFrame is a tabular representation of data containing an ordered collection of columns, each of which can be a different type (numberic, strinf, boolean and so on).

The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. In a data frame, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

While a DataFrame is physically two-dimensional , it can be used to represent higher dimensional data in a tabular format using hierarchical indexing

4.1 Creating DataFrames

1. Creating a data frame a dictionary

In [35]:
data = {'subject': ['Math', 'History', 'Urdu', 'English' , 'Art', 'Computer'],
        'Marks': (45, 55, 78, 94, 78, 76),
        'CGPA': [2.6, 3.8, 3.6, 4.0, 3.1, 3.7]}

df = pd.DataFrame(data)
print(df)

    subject  Marks  CGPA
0      Math     45   2.6
1   History     55   3.8
2      Urdu     78   3.6
3   English     94   4.0
4       Art     78   3.1
5  Computer     76   3.7


Note: Like Series , the resulting DataFrame is assigned index automatically. And the 'Marks' values are in a tuple.

2. To create dataframe from series

In [38]:
Subject = pd.Series(['Math', 'History', 'Urdu', 'English' , 'Art', 'Computer'])
Marks = pd.Series ([45, 55, 78, 94, 78, 76])
CGPA = pd.Series ([2.6, 3.8, 3.6, 4.0, 3.1, 3.7])


In [39]:
pd.DataFrame([Subject, Marks , CGPA], index = ['Subject', 'Marks', 'CGPA'])

Unnamed: 0,0,1,2,3,4,5
Subject,Math,History,Urdu,English,Art,Computer
Marks,45,55,78,94,78,76
CGPA,2.6,3.8,3.6,4.0,3.1,3.7


However to want a vertical datafram so we use .T. The 'T' stands for transpose.

In [40]:
pd.DataFrame([Subject, Marks , CGPA], index = ['Subject', 'Marks', 'CGPA']).T


Unnamed: 0,Subject,Marks,CGPA
0,Math,45,2.6
1,History,55,3.8
2,Urdu,78,3.6
3,English,94,4.0
4,Art,78,3.1
5,Computer,76,3.7


Remark: Assign a name to the data frame and then use .T to transpose it.

4. To create dataframe from lists

In [41]:
Subject = pd.Series(['Math', 'History', 'Urdu', 'English' , 'Art', 'Computer'])
Marks = pd.Series ([45, 55, 78, 94, 78, 76])
CGPA = pd.Series ([2.6, 3.8, 3.6, 4.0, 3.1, 3.7])

In [42]:
pd.DataFrame([Subject, Marks , CGPA], index = ['Subject', 'Marks', 'CGPA']).T


Unnamed: 0,Subject,Marks,CGPA
0,Math,45,2.6
1,History,55,3.8
2,Urdu,78,3.6
3,English,94,4.0
4,Art,78,3.1
5,Computer,76,3.7


5. To read data from csv file

In [6]:

import pandas as pd


In [8]:
data = pd.read_csv(r'C:/Users\Rao Hammad Raza/OneDrive/Pictures/Python ka chilla vs/USA_Housing (1).csv')


on checking the data type, we notice it is read as pands data frame

In [9]:
print(data)

      Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
0         79545.458574             5.682861                   7.009188   
1         79248.642455             6.002900                   6.730821   
2         61287.067179             5.865890                   8.512727   
3         63345.240046             7.188236                   5.586729   
4         59982.197226             5.040555                   7.839388   
...                ...                  ...                        ...   
4995      60567.944140             7.830362                   6.137356   
4996      78491.275435             6.999135                   6.576763   
4997      63390.686886             7.250591                   4.805081   
4998      68001.331235             5.534388                   7.130144   
4999      65510.581804             5.992305                   6.792336   

      Avg. Area Number of Bedrooms  Area Population         Price  \
0                             4.09     230

6. To print head of the data

In [10]:
data.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


By default, the .head() will display first nine rows. However, we can set the desired number of rows to be displayed.

In [12]:
data.head(9)

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386
5,80175.754159,4.988408,6.104512,4.04,26748.428425,1068138.0,"06039 Jennifer Islands Apt. 443\nTracyport, KS..."
6,64698.463428,6.025336,8.14776,3.41,60828.249085,1502056.0,"4759 Daniel Shoals Suite 442\nNguyenburgh, CO ..."
7,78394.339278,6.98978,6.620478,2.42,36516.358972,1573937.0,"972 Joyce Viaduct\nLake William, TN 17778-6483"
8,59927.660813,5.362126,6.393121,2.3,29387.396003,798869.5,USS Gilbert\nFPO AA 20957


7. To print tail of the data

In [13]:
data.tail()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
4995,60567.94414,7.830362,6.137356,3.46,22837.361035,1060194.0,USNS Williams\nFPO AP 30153-7653
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1482618.0,"PSC 9258, Box 8489\nAPO AA 42991-3352"
4997,63390.686886,7.250591,4.805081,2.13,33266.14549,1030730.0,"4215 Tracy Garden Suite 076\nJoshualand, VA 01..."
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1198657.0,USS Wallace\nFPO AE 73316
4999,65510.581804,5.992305,6.792336,4.07,46501.283803,1298950.0,"37778 George Ridges Apt. 509\nEast Holly, NV 2..."


8. To obtain the dimension of the data

In [14]:
data.shape

(5000, 7)

9. To know the data types of a data frame

In [15]:
data.dtypes

Avg. Area Income                float64
Avg. Area House Age             float64
Avg. Area Number of Rooms       float64
Avg. Area Number of Bedrooms    float64
Area Population                 float64
Price                           float64
Address                          object
dtype: object

10. To know some information of the data

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Avg. Area Income              5000 non-null   float64
 1   Avg. Area House Age           5000 non-null   float64
 2   Avg. Area Number of Rooms     5000 non-null   float64
 3   Avg. Area Number of Bedrooms  5000 non-null   float64
 4   Area Population               5000 non-null   float64
 5   Price                         5000 non-null   float64
 6   Address                       5000 non-null   object 
dtypes: float64(6), object(1)
memory usage: 273.6+ KB


11. To check the data type of column in the data frame

In [22]:
type(data['Area Population'])

pandas.core.series.Series

# 5. Manipulating DataFrames

Manipulating the Dataframes

Add new column and rows

# Caution:
1. DataFrame[column] works for any column name but DataFrame.column only works when the column name is a valid Python variable name.
2. New columns cannot be created with the data.BMI syntax.

1. Addign a new column to the data set

In [28]:
data["Total"] = data["Area Population"] / data["Price"] **2

In [29]:
data

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Total
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701...",2.058469e-08
1,79248.642455,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA...",1.771528e-08
2,61287.067179,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482...",3.288776e-08
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820,2.159025e-08
4,59982.197226,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386,6.620144e-08
...,...,...,...,...,...,...,...,...
4995,60567.944140,7.830362,6.137356,3.46,22837.361035,1.060194e+06,USNS Williams\nFPO AP 30153-7653,2.031774e-08
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352",1.165346e-08
4997,63390.686886,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01...",3.131216e-08
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316,2.966750e-08


2. Adding a new row to the data set

A new row can be added using the function copy()

In [33]:
data_copy = data.copy()
data_copy.loc[5000] = [5004,5003, 5006,4543, 45, 56, 67, 6785]

In [34]:
data_copy

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Total
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701...",2.058469e-08
1,79248.642455,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA...",1.771528e-08
2,61287.067179,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482...",3.288776e-08
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820,2.159025e-08
4,59982.197226,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386,6.620144e-08
...,...,...,...,...,...,...,...,...
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352",1.165346e-08
4997,63390.686886,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01...",3.131216e-08
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316,2.966750e-08
4999,65510.581804,5.992305,6.792336,4.07,46501.283803,1.298950e+06,"37778 George Ridges Apt. 509\nEast Holly, NV 2...",2.756003e-08


3. Indexing a dataframe using .iloc

DataFrame.iloc[] method is used when the index label of a data frame is someting other than numeric series of 0,2,3...n or in case the user desn't konw the index label.

We shall work on the Total data set.

Select the 2nd row

In [35]:
data.iloc[2]

Avg. Area Income                                                     61287.067179
Avg. Area House Age                                                       5.86589
Avg. Area Number of Rooms                                                8.512727
Avg. Area Number of Bedrooms                                                 5.13
Area Population                                                        36882.1594
Price                                                              1058987.987876
Address                         9127 Elizabeth Stravenue\nDanieltown, WI 06482...
Total                                                                         0.0
Name: 2, dtype: object

Select 4th, 7th and 10th rows

In [36]:
data.iloc[[4,7,10]]

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Total
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386,6.620144e-08
7,78394.339278,6.98978,6.620478,2.42,36516.358972,1573937.0,"972 Joyce Viaduct\nLake William, TN 17778-6483",1.474053e-08
10,80527.472083,8.093513,5.042747,4.1,47224.35984,1707046.0,"6368 John Motorway Suite 700\nJanetbury, NM 26854",1.6206e-08


We use two square brackets since we are passing a list of row numbers to be accessed.

Select 12th to 17th rows

In [37]:
data.iloc[12:17]

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Total
12,39033.809237,7.671755,7.250029,3.1,39220.361467,1042814.0,"209 Natasha Stream Suite 961\nHuffmanland, NE ...",3.606599e-08
13,73163.663441,6.919535,5.993188,2.27,32326.123139,1291332.0,"829 Welch Track Apt. 992\nNorth John, AR 26532...",1.938555e-08
14,69391.380184,5.344776,8.406418,4.37,35521.294033,1402818.0,"PSC 5330, Box 4420\nAPO AP 08302",1.805037e-08
15,73091.866746,5.443156,8.517513,4.01,23929.524053,1306675.0,"2278 Shannon View\nNorth Carriemouth, NM 84617",1.401519e-08
16,79706.963058,5.06789,8.219771,3.12,39717.813576,1556787.0,"064 Hayley Unions\nNicholsborough, HI 44161-1887",1.638805e-08


Select the 1st column

In [38]:
data.iloc[:, 1]

0       5.682861
1       6.002900
2       5.865890
3       7.188236
4       5.040555
          ...   
4995    7.830362
4996    6.999135
4997    7.250591
4998    5.534388
4999    5.992305
Name: Avg. Area House Age, Length: 5000, dtype: float64

Select the last column

In [39]:
data.iloc[:, -1]

0       2.058469e-08
1       1.771528e-08
2       3.288776e-08
3       2.159025e-08
4       6.620144e-08
            ...     
4995    2.031774e-08
4996    1.165346e-08
4997    3.131216e-08
4998    2.966750e-08
4999    2.756003e-08
Name: Total, Length: 5000, dtype: float64

To select the last column we use-1 , to select the second last column we use -2

Select the first two columns

In [40]:
data.iloc[:,0:2]

Unnamed: 0,Avg. Area Income,Avg. Area House Age
0,79545.458574,5.682861
1,79248.642455,6.002900
2,61287.067179,5.865890
3,63345.240046,7.188236
4,59982.197226,5.040555
...,...,...
4995,60567.944140,7.830362
4996,78491.275435,6.999135
4997,63390.686886,7.250591
4998,68001.331235,5.534388


Select the first two columns and 5 to 10 rows

In [41]:
data.iloc[5:11, 0:2]

Unnamed: 0,Avg. Area Income,Avg. Area House Age
5,80175.754159,4.988408
6,64698.463428,6.025336
7,78394.339278,6.98978
8,59927.660813,5.362126
9,81885.927184,4.423672
10,80527.472083,8.093513


4. Indexing a dataframe using .loc

DataFrame.loc[ ] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame.

DataFrame.loc[Row_names, column_names] is used to select or index rows or columns based on their name.

Select 1 to 5 rows and 2nd and 4th columns

In [42]:
data.loc[1:5,["Avg. Area Income" , "Total"]]

Unnamed: 0,Avg. Area Income,Total
1,79248.642455,1.771528e-08
2,61287.067179,3.288776e-08
3,63345.240046,2.159025e-08
4,59982.197226,6.620144e-08
5,80175.754159,2.344464e-08


In [43]:
data.iloc[1:5,[1,3]]

Unnamed: 0,Avg. Area House Age,Avg. Area Number of Bedrooms
1,6.0029,3.09
2,5.86589,5.13
3,7.188236,3.26
4,5.040555,4.23


Note: The row names are numbers

5. Selecting solumns by specifying column names

In [44]:
data.Total

0       2.058469e-08
1       1.771528e-08
2       3.288776e-08
3       2.159025e-08
4       6.620144e-08
            ...     
4995    2.031774e-08
4996    1.165346e-08
4997    3.131216e-08
4998    2.966750e-08
4999    2.756003e-08
Name: Total, Length: 5000, dtype: float64

Remark: Using this method we can select only one colum.

In [45]:
# OR
data["Avg. Area Income"]

0       79545.458574
1       79248.642455
2       61287.067179
3       63345.240046
4       59982.197226
            ...     
4995    60567.944140
4996    78491.275435
4997    63390.686886
4998    68001.331235
4999    65510.581804
Name: Avg. Area Income, Length: 5000, dtype: float64

Select the column 'Total' and 'Avg.Area Income' Specifice

In [46]:
data[["Avg. Area Income", "Total"]]

Unnamed: 0,Avg. Area Income,Total
0,79545.458574,2.058469e-08
1,79248.642455,1.771528e-08
2,61287.067179,3.288776e-08
3,63345.240046,2.159025e-08
4,59982.197226,6.620144e-08
...,...,...
4995,60567.944140,2.031774e-08
4996,78491.275435,1.165346e-08
4997,63390.686886,3.131216e-08
4998,68001.331235,2.966750e-08


6. Sort the data frame on the basis of values in a column
   


Each column of a pandas DataFrame is treated as a pandas Series. The .Sort_values() in DataFrames works similar to the pandas.Series

In [48]:
# Print head() of 'data'
data.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Total
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701...",2.058469e-08
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA...",1.771528e-08
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482...",3.288776e-08
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820,2.159025e-08
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386,6.620144e-08


In [50]:
# Sort the data frame on basis of 'Age' values
# by default the values will get sorted in ascending  order
data.sort_values('Avg. Area Income')

# Note : 'ascending = False' will sort the data frame in descending order

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Total
39,17796.631190,4.949557,6.713905,2.50,47162.183643,3.023558e+05,"9932 Eric Circles\nLake Martha, WY 34611-6127",5.158901e-07
3069,35454.714659,6.855708,6.018647,4.50,59636.402553,1.077806e+06,Unit 4700 Box 1880\nDPO AP 18074,5.133701e-08
2092,35608.986237,6.935839,7.827589,6.35,20833.007623,4.493316e+05,"652 Stanton Island\nAdamsview, VA 56957-9960",1.031854e-07
4855,35797.323122,5.544221,7.795138,5.00,24844.200190,2.998630e+05,"645 Mary Radial\nEast Roberto, CA 23652-5430",2.762989e-07
1459,35963.330809,3.438547,8.264122,3.28,24435.777302,1.430274e+05,"166 Terry Grove\nSouth Michaelhaven, PR 18054",1.194505e-06
...,...,...,...,...,...,...,...,...
2719,101599.670580,7.798746,7.480512,6.39,37523.864670,2.370231e+06,"52280 Steven Street\nRobertchester, IA 40405-0504",6.679225e-09
962,101928.858060,4.829586,9.039382,4.08,22804.991935,1.938866e+06,"856 Harris Centers Suite 940\nNicholasport, IL...",6.066443e-09
3541,102881.120902,6.471249,5.693536,3.12,21051.531294,1.754938e+06,"784 Arnold Prairie Apt. 787\nJamesside, NM 04270",6.835342e-09
1734,104702.724257,5.575523,6.932106,3.22,22560.527135,1.742432e+06,"14230 Douglas River Suite 570\nConniechester, ...",7.430837e-09


7. Rank the dataframe

In [53]:
# rank the data frame 'data' in descending order based on 'Avg. Area Income'
# 'method = min ' assigns the minimum rank to highest equal value of 'Avg. Area Income'

data ['Avg. Area Income_rank'] = data['Avg. Area Income'].rank(ascending = 0 ,method = 'min')
data

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Total,Avg. Area Income_rank
0,4254.0,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701...",2.058469e-08,747.0
1,4224.0,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA...",1.771528e-08,777.0
2,1222.0,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482...",3.288776e-08,3779.0
3,1573.0,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820,2.159025e-08,3428.0
4,1038.0,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386,6.620144e-08,3963.0
...,...,...,...,...,...,...,...,...,...
4995,1109.0,7.830362,6.137356,3.46,22837.361035,1.060194e+06,USNS Williams\nFPO AP 30153-7653,2.031774e-08,3892.0
4996,4136.0,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352",1.165346e-08,865.0
4997,1582.0,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01...",3.131216e-08,3419.0
4998,2364.0,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316,2.966750e-08,2637.0


From the above data frame, we can see that'Avg. Area Income = 3428.0' is repeating thrice; thus the minimum rank (=1) to all the three values of Avg. Area Income. The rank '4' will be assigned to the second largest value fo Avg. Area Income and so on . Thus, there is no rank equal to 2 and 3.

In [55]:
# method = 'dense' assigns same rank to all the same Avg. Area Income values
data['Avg. Area Income_densed_rank'] = data['Avg. Area Income'].rank(method= 'dense')
data

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address,Total,Avg. Area Income_rank,Avg. Area Income_densed_rank
0,4254.0,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701...",2.058469e-08,747.0,4254.0
1,4224.0,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA...",1.771528e-08,777.0,4224.0
2,1222.0,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482...",3.288776e-08,3779.0,1222.0
3,1573.0,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820,2.159025e-08,3428.0,1573.0
4,1038.0,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386,6.620144e-08,3963.0,1038.0
...,...,...,...,...,...,...,...,...,...,...
4995,1109.0,7.830362,6.137356,3.46,22837.361035,1.060194e+06,USNS Williams\nFPO AP 30153-7653,2.031774e-08,3892.0,1109.0
4996,4136.0,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352",1.165346e-08,865.0,4136.0
4997,1582.0,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01...",3.131216e-08,3419.0,1582.0
4998,2364.0,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316,2.966750e-08,2637.0,2364.0


Here, dense method assigns minimum rank(=4) to minimum value (=1038.0) of the Avg. Area Income. Rank2 will be assigned to Avg. Area Income value greater than mim = 1038.0 and so on. Thus , no rank is skipped inthe dense method.

8. To check for missing values

We shall import a new dataset.

In [56]:
missing_data = pd.read_csv(r'C:/Users\Rao Hammad Raza/OneDrive/Pictures/Python ka chilla vs/USA_Housing (1).csv')
missing_data

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386
...,...,...,...,...,...,...,...
4995,60567.944140,7.830362,6.137356,3.46,22837.361035,1.060194e+06,USNS Williams\nFPO AP 30153-7653
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352"
4997,63390.686886,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01..."
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316


In [58]:
# Remove Address feature
missing_data.drop('Address', axis = 1 , inplace= True)

In [59]:
# Remove rows with missing data
missing_data.dropna(inplace= True)

In [60]:
missing_data

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1.059034e+06
1,79248.642455,6.002900,6.730821,3.09,40173.072174,1.505891e+06
2,61287.067179,5.865890,8.512727,5.13,36882.159400,1.058988e+06
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1.260617e+06
4,59982.197226,5.040555,7.839388,4.23,26354.109472,6.309435e+05
...,...,...,...,...,...,...
4995,60567.944140,7.830362,6.137356,3.46,22837.361035,1.060194e+06
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1.482618e+06
4997,63390.686886,7.250591,4.805081,2.13,33266.145490,1.030730e+06
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1.198657e+06


This data not missing value.

In [61]:
missing_data.isna()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
4995,False,False,False,False,False,False
4996,False,False,False,False,False,False
4997,False,False,False,False,False,False
4998,False,False,False,False,False,False


# 5. Reading Data from Different Sources

1. Read a .xlsx file
   

In [None]:
pd.read_ex('example.xlsx')

2. Read a .zip file

3. Read a .txt file

In [None]:
data = pd.read_txt('example.txt', sep="\t")
data.head()