# Pandas Basics



The table has one row for each album and several columns

- **artist** - Name of the artist
- **album** - Name of the album
- **released_year** - Year the album was released
- **length_min_sec** - Length of the album (hours,minutes,seconds)
- **genre** - Genre of the album
- **music_recording_sales_millions** - Music recording sales (millions in USD) on [SONG://DATABASE](http://www.song-database.com/)
- **claimed_sales_millions** - Album's claimed sales (millions in USD) on [SONG://DATABASE](http://www.song-database.com/)
- **date_released** - Date on which the album was released
- **soundtrack** - Indicates if the album is the movie soundtrack (Y) or (N)
- **rating_of_friends** - Indicates the rating from your friends from 1 to 10
<br>

You can see the dataset here:

<font size="1">
<table font-size:xx-small style="width:85%">
  <tr>
    <th>Artist</th>
    <th>Album</th> 
    <th>Released</th>
    <th>Length</th>
    <th>Genre</th> 
    <th>Music recording sales (millions)</th>
    <th>Claimed sales (millions)</th>
    <th>Released</th>
    <th>Soundtrack</th>
    <th>Rating (friends)</th>
  </tr>
  <tr>
    <td>Michael Jackson</td>
    <td>Thriller</td> 
    <td>1982</td>
    <td>00:42:19</td>
    <td>Pop, rock, R&B</td>
    <td>46</td>
    <td>65</td>
    <td>30-Nov-82</td>
    <td></td>
    <td>10.0</td>
  </tr>
  <tr>
    <td>AC/DC</td>
    <td>Back in Black</td> 
    <td>1980</td>
    <td>00:42:11</td>
    <td>Hard rock</td>
    <td>26.1</td>
    <td>50</td>
    <td>25-Jul-80</td>
    <td></td>
    <td>8.5</td>
  </tr>
    <tr>
    <td>Pink Floyd</td>
    <td>The Dark Side of the Moon</td> 
    <td>1973</td>
    <td>00:42:49</td>
    <td>Progressive rock</td>
    <td>24.2</td>
    <td>45</td>
    <td>01-Mar-73</td>
    <td></td>
    <td>9.5</td>
  </tr>
    <tr>
    <td>Whitney Houston</td>
    <td>The Bodyguard</td> 
    <td>1992</td>
    <td>00:57:44</td>
    <td>Soundtrack/R&B, soul, pop</td>
    <td>26.1</td>
    <td>50</td>
    <td>25-Jul-80</td>
    <td>Y</td>
    <td>7.0</td>
  </tr>
    <tr>
    <td>Meat Loaf</td>
    <td>Bat Out of Hell</td> 
    <td>1977</td>
    <td>00:46:33</td>
    <td>Hard rock, progressive rock</td>
    <td>20.6</td>
    <td>43</td>
    <td>21-Oct-77</td>
    <td></td>
    <td>7.0</td>
  </tr>
    <tr>
    <td>Eagles</td>
    <td>Their Greatest Hits (1971-1975)</td> 
    <td>1976</td>
    <td>00:43:08</td>
    <td>Rock, soft rock, folk rock</td>
    <td>32.2</td>
    <td>42</td>
    <td>17-Feb-76</td>
    <td></td>
    <td>9.5</td>
  </tr>
    <tr>
    <td>Bee Gees</td>
    <td>Saturday Night Fever</td> 
    <td>1977</td>
    <td>1:15:54</td>
    <td>Disco</td>
    <td>20.6</td>
    <td>40</td>
    <td>15-Nov-77</td>
    <td>Y</td>
    <td>9.0</td>
  </tr>
    <tr>
    <td>Fleetwood Mac</td>
    <td>Rumours</td> 
    <td>1977</td>
    <td>00:40:01</td>
    <td>Soft rock</td>
    <td>27.9</td>
    <td>40</td>
    <td>04-Feb-77</td>
    <td></td>
    <td>9.5</td>
  </tr>
</table></font>

<a id="ref1"></a>
<h2 align=center> Importing Data </h2>

We can import the libraries  or dependency like Pandas  using the following command:

In [2]:
import pandas as pd

After the import command, we now have access to a large number of pre-built classes and functions. This assumes the library is installed; in our lab environment all the necessary libraries are installed. One way pandas allows you to work with data is a dataframe. Let's go through the process to go from a comma separated values (**.csv** ) file to a dataframe. This variable **csv_path** stores the path of the  **.csv** ,that is  used as an argument to the **read_csv** function. The result is stored in the object ** df**, this is a common short form used for a variable referring to a Pandas dataframe. 

In [None]:
df=pd.read_csv('top_selling_albums.csv')

In [6]:
df= pd.read_csv('top_selling_albums.csv')
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,17-Feb-76,,7.5
6,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,15-Nov-77,Y,7.0
7,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,04-Feb-77,,6.5


In [7]:
type(df)

pandas.core.frame.DataFrame

We can use the method **head()** to examine the first five rows of a dataframe: 

In [8]:
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0


We can use the method **tail()** to examine the last five rows of a dataframe: 

In [9]:
df.tail()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,17-Feb-76,,7.5
6,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,15-Nov-77,Y,7.0
7,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,04-Feb-77,,6.5


In [10]:
df.head(6)

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,17-Feb-76,,7.5


We can use the attribute **shape** to examine the number of rows and columns of a dataframe: 

In [11]:
df.shape

(8, 10)

In [16]:
df.columns

Index(['Artist', 'Album', 'Released', 'Length', 'Genre',
       'Music Recording Sales (millions)', 'Claimed Sales (millions)',
       'Released.1', 'Soundtrack', 'Rating'],
      dtype='object')

The process for loading an excel file is similar, we use the path of the excel file and the function **read_excel**. The result is a data frame as before:

In [None]:
#dependency  needed to install file checking package existance with in environment
#!pip install xlrd

In [19]:
import os

os.getcwd()

'C:\\Users\\fayaz_000\\Desktop\\AML for AI\\Pandas Basics'

In [20]:
os.listdir()

['.ipynb_checkpoints',
 'cars.csv',
 'Pandas Basics-Copy1.ipynb',
 'Pandas Basics.ipynb',
 'top_selling_albums.csv',
 'top_selling_albums.xlsx',
 'Untitled.ipynb']

In [21]:
df = pd.read_excel('C:\\Users\\fayaz_000\\Desktop\\AML for AI\\Pandas Basics\\top_selling_albums.xlsx')
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,00:42:11,hard rock,26.1,50,1980-07-25,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,00:42:49,progressive rock,24.2,45,1973-03-01,,9.0
3,Whitney Houston,The Bodyguard,1992,00:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,00:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0


We can access the column "Length" and assign it a new dataframe 'x':

In [24]:
x=df[['Length']]
print (type(x))
x

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Length
0,00:42:19
1,00:42:11
2,00:42:49
3,00:57:44
4,00:46:33
5,00:43:08
6,01:15:54
7,00:40:01


 The process is shown in the figure: 

<img src = "https://ibm.box.com/shared/static/bz800py5ui4w0kpb0k09lq3k5oegop5v.png" width = 750, align = "center"></a>

 <a id="ref2"></a>
<h2 align=center> Viewing Data and Accessing Data </h2>

You can also assign the value to a series, you can think of a Pandas series as a 1-D dataframe. Just use one bracket: 

In [25]:
x=df['Length']
type(x)

pandas.core.series.Series

In [26]:
x

0    00:42:19
1    00:42:11
2    00:42:49
3    00:57:44
4    00:46:33
5    00:43:08
6    01:15:54
7    00:40:01
Name: Length, dtype: object

You can also assign different columns, for example, we can assign the column 'Artist':

In [27]:
newvar=df[['Released.1']]
newvar

Unnamed: 0,Released.1
0,1982-11-30
1,1980-07-25
2,1973-03-01
3,1992-11-17
4,1977-10-21
5,1976-02-17
6,1977-11-15
7,1977-02-04


In [28]:
x=df[['Artist']]
type(x)

pandas.core.frame.DataFrame

#### Assign the variable 'q' to the dataframe that is made up of the column 'Rating':


In [29]:
#write code here
q=df[['Rating']]
q

Unnamed: 0,Rating
0,10.0
1,9.5
2,9.0
3,8.5
4,8.0
5,7.5
6,7.0
7,6.5


You can do the same thing for multiple columns; we just put the dataframe name, in this case, **df**, and the name of the multiple column headers enclosed in double brackets. The result is a new dataframe comprised of the specified columns:

In [30]:
y=df[['Length','Artist','Genre']]
y

Unnamed: 0,Length,Artist,Genre
0,00:42:19,Michael Jackson,"pop, rock, R&B"
1,00:42:11,AC/DC,hard rock
2,00:42:49,Pink Floyd,progressive rock
3,00:57:44,Whitney Houston,"R&B, soul, pop"
4,00:46:33,Meat Loaf,"hard rock, progressive rock"
5,00:43:08,Eagles,"rock, soft rock, folk rock"
6,01:15:54,Bee Gees,disco
7,00:40:01,Fleetwood Mac,soft rock


The process is shown in the figure:

In [31]:
A=['Length', 'Artist', 'Genre']
A

['Length', 'Artist', 'Genre']

In [33]:
p=df[A]
print (type(p))
p

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Length,Artist,Genre
0,00:42:19,Michael Jackson,"pop, rock, R&B"
1,00:42:11,AC/DC,hard rock
2,00:42:49,Pink Floyd,progressive rock
3,00:57:44,Whitney Houston,"R&B, soul, pop"
4,00:46:33,Meat Loaf,"hard rock, progressive rock"
5,00:43:08,Eagles,"rock, soft rock, folk rock"
6,01:15:54,Bee Gees,disco
7,00:40:01,Fleetwood Mac,soft rock


<img src = "https://ibm.box.com/shared/static/dh9duk3ucuhmmmbixa6ugac6g384m5sq.png" width = 1100, align = "center"></a>

In [None]:
df[['Album','Released','Length']]

#### Assign the variable 'q' to the dataframe that is made up of the column 'Released' and 'Artist':

In [None]:
q=df[['Released','Artist']]
q

One way to access unique elements is the 'iloc' method, where you can access the 1st row and first column as follows :

In [36]:
df.head(2)

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,00:42:11,hard rock,26.1,50,1980-07-25,,9.5


In [46]:
df.iloc[0,0]

'Michael Jackson'

In [61]:
df.iloc[0,0:1], df.iloc[0:1,0]

(Artist    Michael Jackson
 Name: 0, dtype: object,
 0    Michael Jackson
 Name: Artist, dtype: object)

Notice something? A series is returned

Let's convert it to table form

In [51]:
a=df.iloc[0:2,0:6]
print (type(a))
a

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions)
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0
1,AC/DC,Back in Black,1980,00:42:11,hard rock,26.1


 You can access the 1st 2 rows as follows: 

In [52]:
df.iloc[0:2]

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,00:42:11,hard rock,26.1,50,1980-07-25,,9.5


In [56]:
df.iloc[1,:]

Artist                                            AC/DC
Album                                     Back in Black
Released                                           1980
Length                                         00:42:11
Genre                                         hard rock
Music Recording Sales (millions)                   26.1
Claimed Sales (millions)                             50
Released.1                          1980-07-25 00:00:00
Soundtrack                                          NaN
Rating                                              9.5
Name: 1, dtype: object

In [60]:
df.iloc[0:1,0]

0    Michael Jackson
Name: Artist, dtype: object

You can access columns with rows as well:

In [62]:
df.iloc[0:2,0:2]

Unnamed: 0,Artist,Album
0,Michael Jackson,Thriller
1,AC/DC,Back in Black


There is another method call 'loc' which uses names of row and column indexes.
Notice that 'iloc' was using row and column index to access values

In [63]:
df.loc[0:2,['Album']]

Unnamed: 0,Album
0,Thriller
1,Back in Black
2,The Dark Side of the Moon


In [64]:
df.loc[0:2,'Album']

0                     Thriller
1                Back in Black
2    The Dark Side of the Moon
Name: Album, dtype: object

Access multiple columns by 'loc'

In [None]:
df.loc[0:2,['Album','Length']] 

#### Access the 4th to 6th row:

In [65]:
df.iloc[4:7]

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
4,Meat Loaf,Bat Out of Hell,1977,00:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,00:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,,7.5
6,Bee Gees,Saturday Night Fever,1977,01:15:54,disco,20.6,40,1977-11-15,Y,7.0


Trying with iloc it would be

#### Access 5th and rows after 5th with columns starting from Artist till Length

In [66]:
df.loc[5:,['Artist','Album','Released','Length']]

Unnamed: 0,Artist,Album,Released,Length
5,Eagles,Their Greatest Hits (1971-1975),1976,00:43:08
6,Bee Gees,Saturday Night Fever,1977,01:15:54
7,Fleetwood Mac,Rumours,1977,00:40:01


Using iloc

In [68]:
df.iloc[5:,0:4]

Unnamed: 0,Artist,Album,Released,Length
5,Eagles,Their Greatest Hits (1971-1975),1976,00:43:08
6,Bee Gees,Saturday Night Fever,1977,01:15:54
7,Fleetwood Mac,Rumours,1977,00:40:01


## Adding Column

In [75]:
df['Song Name']=[1,2,3,4,5,6,7,8]

In [85]:
df['New Artist'] = df['Artist']

In [86]:
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating,Song Name,New Artist
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0,1,Michael Jackson
1,AC/DC,Back in Black,1980,00:42:11,hard rock,26.1,50,1980-07-25,,9.5,2,AC/DC
2,Pink Floyd,The Dark Side of the Moon,1973,00:42:49,progressive rock,24.2,45,1973-03-01,,9.0,3,Pink Floyd
3,Whitney Houston,The Bodyguard,1992,00:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5,4,Whitney Houston
4,Meat Loaf,Bat Out of Hell,1977,00:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0,5,Meat Loaf


## Droping Column

In [87]:
df.drop(['New Artist','Song Name'], axis=1, inplace=True)

In [88]:
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,00:42:11,hard rock,26.1,50,1980-07-25,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,00:42:49,progressive rock,24.2,45,1973-03-01,,9.0
3,Whitney Houston,The Bodyguard,1992,00:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,00:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0


## Object Type of each column

In [89]:
df.dtypes

Artist                                      object
Album                                       object
Released                                     int64
Length                                      object
Genre                                       object
Music Recording Sales (millions)           float64
Claimed Sales (millions)                     int64
Released.1                          datetime64[ns]
Soundtrack                                  object
Rating                                     float64
dtype: object

## Null values check in Data Frame

In [90]:
df.isnull()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,True,False
5,False,False,False,False,False,False,False,False,True,False
6,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,True,False


In [91]:
df.isnull().sum()

Artist                              0
Album                               0
Released                            0
Length                              0
Genre                               0
Music Recording Sales (millions)    0
Claimed Sales (millions)            0
Released.1                          0
Soundtrack                          6
Rating                              0
dtype: int64

## Summary Statistics

In [92]:
df.describe()

Unnamed: 0,Released,Music Recording Sales (millions),Claimed Sales (millions),Rating
count,8.0,8.0,8.0,8.0
mean,1979.25,28.125,46.125,8.25
std,5.800246,8.189322,8.271077,1.224745
min,1973.0,20.6,40.0,6.5
25%,1976.75,23.3,41.5,7.375
50%,1977.0,26.75,43.5,8.25
75%,1980.5,28.975,46.25,9.125
max,1992.0,46.0,65.0,10.0


In [93]:
df.describe(include='all')

  df.describe(include='all')


Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
count,8,8,8.0,8,8,8.0,8.0,8,2,8.0
unique,8,8,,8,8,,,8,1,
top,Whitney Houston,Bat Out of Hell,,00:42:49,soft rock,,,1982-11-30 00:00:00,Y,
freq,1,1,,1,1,,,1,2,
first,,,,,,,,1973-03-01 00:00:00,,
last,,,,,,,,1992-11-17 00:00:00,,
mean,,,1979.25,,,28.125,46.125,,,8.25
std,,,5.800246,,,8.189322,8.271077,,,1.224745
min,,,1973.0,,,20.6,40.0,,,6.5
25%,,,1976.75,,,23.3,41.5,,,7.375


## Querying a dataframe

Querying a database means finding some values based on certain conditions. For example you want to find out the albums having rating greater and equal to 9

In [94]:
soundtracks= df[df['Rating']>=9.0]


In [95]:
soundtracks

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,00:42:11,hard rock,26.1,50,1980-07-25,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,00:42:49,progressive rock,24.2,45,1973-03-01,,9.0


Notice that in the above result all the columns are displayed. If you want to access a specific column, then you can use **loc** for that purpose.

In [101]:
s=df[df['Rating']>=9.0]
s

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,00:42:11,hard rock,26.1,50,1980-07-25,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,00:42:49,progressive rock,24.2,45,1973-03-01,,9.0


In [102]:
soundtracks_album= df.loc[df['Rating']>=9.0, ['Album']]

In [103]:
soundtracks_album

Unnamed: 0,Album
0,Thriller
1,Back in Black
2,The Dark Side of the Moon


### Task

Find out the albums released during and after the year 1980

In [109]:
# write code here
released=df.loc[df['Released']>=1980,['Album']]
released

Unnamed: 0,Album
0,Thriller
1,Back in Black
3,The Bodyguard


# Exercise

Import file **cars.csv** and show following:

    1.) Top five rows
    2.) Last five rows
    3.) No. of rows and columns
    4.) Access 100th to 130th row with any 3 columns
    5.) Object types of each column
    6.) Null values count in each column
    7.) Summary Statistics

In [115]:
#importing the file
df_cars=pd.read_csv('cars.csv')
df_cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450


In [116]:
#top 5 rows
df_cars[0:6]


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450
5,2,122.0,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110.0,5500.0,19,25,15250


In [120]:
#last 5 rows
df_cars[22:27]


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
22,1,118.0,dodge,gas,turbo,two,hatchback,fwd,front,93.7,...,98,mpfi,3.03,3.39,7.6,102.0,5500.0,24,30,7957
23,1,148.0,dodge,gas,std,four,hatchback,fwd,front,93.7,...,90,2bbl,2.97,3.23,9.4,68.0,5500.0,31,38,6229
24,1,148.0,dodge,gas,std,four,sedan,fwd,front,93.7,...,90,2bbl,2.97,3.23,9.4,68.0,5500.0,31,38,6692
25,1,148.0,dodge,gas,std,four,sedan,fwd,front,93.7,...,90,2bbl,2.97,3.23,9.4,68.0,5500.0,31,38,7609
26,1,148.0,dodge,gas,turbo,four,sedan,fwd,front,93.7,...,98,mpfi,3.03,3.39,7.6,102.0,5500.0,24,30,8558


In [126]:
# #Access 100th to 130th row with any 3 columns
df_cars.iloc[100:131,0:3]


Unnamed: 0,symboling,normalized-losses,make
100,0,108.0,nissan
101,3,194.0,nissan
102,3,194.0,nissan
103,1,231.0,nissan
104,0,161.0,peugot
105,0,161.0,peugot
106,0,122.0,peugot
107,0,122.0,peugot
108,0,161.0,peugot
109,0,161.0,peugot


In [129]:
# object types of each column
df_cars.dtypes

symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                  int64
dtype: object

In [130]:
#Null values count in each column
df_cars.isnull().sum()

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

In [131]:
#Summary statistics
df_cars.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0
mean,0.840796,122.0,98.797015,174.200995,65.889055,53.766667,2555.666667,126.875622,3.330692,3.256874,10.164279,103.405534,5117.665368,25.179104,30.686567,13207.129353
std,1.254802,31.99625,6.066366,12.322175,2.101471,2.447822,517.296727,41.546834,0.268072,0.316048,4.004965,37.3657,478.113805,6.42322,6.81515,7947.066342
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,101.0,94.5,166.8,64.1,52.0,2169.0,98.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,122.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5125.369458,24.0,30.0,10295.0
75%,2.0,137.0,102.4,183.5,66.6,55.5,2926.0,141.0,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,208.1,72.0,59.8,4066.0,326.0,3.94,4.17,23.0,262.0,6600.0,49.0,54.0,45400.0


# Grouping

Grouping works similar to "group by" as in databases. Pandas also provides a "group by" function which serves the same purpose.

In [141]:
fuel_price=df_cars.groupby('fuel-type')['price'].mean()

In [135]:
fuel_price

fuel-type
diesel    15838.15000
gas       12916.40884
Name: price, dtype: float64

In [136]:
door_price=df_cars.groupby('num-of-doors')['price'].mean()

In [137]:
door_price

num-of-doors
four    13498.034783
two     12818.127907
Name: price, dtype: float64

In [138]:
group=df_cars.groupby(['fuel-type','num-of-doors'])['price'].mean().reset_index()

In [139]:
group

Unnamed: 0,fuel-type,num-of-doors,price
0,diesel,four,16100.764706
1,diesel,two,14350.0
2,gas,four,13046.540816
3,gas,two,12762.759036


### Task

Find the total price according to each body style.

Hint: use <code>sum()</code> instead of <code>mean</code>.

In [None]:
# write the code here



# Pivoting and Melting

**Pivoting**

In [None]:
pivoted= pd.pivot_table(group,index='fuel-type',columns='num-of-doors', values='price',aggfunc='mean',fill_value=0)

In [None]:
pivoted

In [None]:
new= pivoted.reset_index()

In [None]:
new

**Melting**

In [None]:
melted= pd.melt(new,id_vars='fuel-type', value_name='Mean Price')

In [None]:
melted