In [1]:
import pandas as pd

movies = pd.read_csv('movies.csv', error_bad_lines=False)

### Selecting Rows and Columns

With loc and .iloc we can select the rows based on the labels so reset the index to film name by using set_index

In [3]:
movies.set_index('Film', inplace=True)

In [5]:
display(movies.head())

Unnamed: 0_level_0,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Zack and Miri Make a Porno,Romance,The Weinstein Company,70,1.747542,64,$41.94,2008
Youth in Revolt,Comedy,The Weinstein Company,52,1.09,68,$19.62,2010
You Will Meet a Tall Dark Stranger,Comedy,Independent,35,1.211818,43,$26.66,2010
When in Rome,Comedy,Disney,44,0.0,15,$43.04,2010
What Happens in Vegas,Comedy,Fox,72,6.267647,28,$219.37,2008


To select the third row in movies DataFrame,  pass number 2 to the .iloc indexer

In [8]:
display(movies.iloc[2])

Genre                     Comedy
Lead Studio          Independent
Audience score %              35
Profitability            1.21182
Rotten Tomatoes %             43
Worldwide Gross          $26.66 
Year                        2010
Name: You Will Meet a Tall Dark Stranger, dtype: object

To do the same thing, I use the .loc indexer

In [9]:
display(movies.loc['You Will Meet a Tall Dark Stranger'])

Genre                     Comedy
Lead Studio          Independent
Audience score %              35
Profitability            1.21182
Rotten Tomatoes %             43
Worldwide Gross          $26.66 
Year                        2010
Name: You Will Meet a Tall Dark Stranger, dtype: object

To select rows with different index positions,  pass a list to the .iloc indexer.

In [10]:
display(movies.iloc[[1, 3, 7]])

Unnamed: 0_level_0,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Youth in Revolt,Comedy,The Weinstein Company,52,1.09,68,$19.62,2010
When in Rome,Comedy,Disney,44,0.0,15,$43.04,2010
Waitress,Romance,Independent,67,11.089741,89,$22.18,2007


### Selecting Rows and Columns simultaneously

In [11]:
movies.iloc[:, [3, 4, 5]].head()

Unnamed: 0_level_0,Profitability,Rotten Tomatoes %,Worldwide Gross
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Zack and Miri Make a Porno,1.747542,64,$41.94
Youth in Revolt,1.09,68,$19.62
You Will Meet a Tall Dark Stranger,1.211818,43,$26.66
When in Rome,0.0,15,$43.04
What Happens in Vegas,6.267647,28,$219.37


### Selecting a single column

In [13]:
movies['Profitability']

Film
Zack and Miri Make a Porno            1.747542
Youth in Revolt                       1.090000
You Will Meet a Tall Dark Stranger    1.211818
When in Rome                          0.000000
What Happens in Vegas                 6.267647
                                        ...   
Across the Universe                   0.652603
A Serious Man                         4.382857
A Dangerous Method                    0.448645
27 Dresses                            5.343622
(500) Days of Summer                  8.096000
Name: Profitability, Length: 77, dtype: float64

### selecting multiple columns

In [14]:
movies[['Profitability', 'Worldwide Gross']].head()

Unnamed: 0_level_0,Profitability,Worldwide Gross
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
Zack and Miri Make a Porno,1.747542,$41.94
Youth in Revolt,1.09,$19.62
You Will Meet a Tall Dark Stranger,1.211818,$26.66
When in Rome,0.0,$43.04
What Happens in Vegas,6.267647,$219.37


### selecting columns with index

In [15]:
movies.iloc[:, 2]

Film
Zack and Miri Make a Porno            70
Youth in Revolt                       52
You Will Meet a Tall Dark Stranger    35
When in Rome                          44
What Happens in Vegas                 72
                                      ..
Across the Universe                   84
A Serious Man                         64
A Dangerous Method                    89
27 Dresses                            71
(500) Days of Summer                  81
Name: Audience score %, Length: 77, dtype: int64

### selecting range of columns with index

In [16]:
movies.iloc[:, :2].head()

Unnamed: 0_level_0,Genre,Lead Studio
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
Zack and Miri Make a Porno,Romance,The Weinstein Company
Youth in Revolt,Comedy,The Weinstein Company
You Will Meet a Tall Dark Stranger,Comedy,Independent
When in Rome,Comedy,Disney
What Happens in Vegas,Comedy,Fox


### String Operations

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods

In [18]:
movies['Genre'] = movies['Genre'].str.upper()
display(movies['Genre'])

Film
Zack and Miri Make a Porno            ROMANCE
Youth in Revolt                        COMEDY
You Will Meet a Tall Dark Stranger     COMEDY
When in Rome                           COMEDY
What Happens in Vegas                  COMEDY
                                       ...   
Across the Universe                   ROMANCE
A Serious Man                           DRAMA
A Dangerous Method                      DRAMA
27 Dresses                             COMEDY
(500) Days of Summer                   COMEDY
Name: Genre, Length: 77, dtype: object

In [19]:
movies['Genre'].str.len()

Film
Zack and Miri Make a Porno            7
Youth in Revolt                       6
You Will Meet a Tall Dark Stranger    6
When in Rome                          6
What Happens in Vegas                 6
                                     ..
Across the Universe                   7
A Serious Man                         5
A Dangerous Method                    5
27 Dresses                            6
(500) Days of Summer                  6
Name: Genre, Length: 77, dtype: int64

The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace

In [20]:
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2),columns=[' Column A ', ' Column B '], index=range(3))
df.columns.str.strip()

Index(['Column A', 'Column B'], dtype='object')

These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores

In [21]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

In [22]:
display(df)

Unnamed: 0,column_a,column_b
0,-0.712127,-0.286342
1,-0.119727,-1.189909
2,0.228575,-0.702177


### Finding Unique Values

Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new ExtensionArray of that type with just the unique values is returned. This includes

* Categorical 
* Period 
* Datetime with Timezone 
* Interval 
* Sparse 
* IntegerNA 

In [23]:
pd.Series([2, 1, 3, 3], name='A').unique()

array([2, 1, 3])

In [24]:
pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique()

array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [25]:
pd.Series([pd.Timestamp('2016-01-01', tz='US/Eastern')
           for _ in range(3)]).unique()

<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

### Drop Duplicates

drop_duplicates returns only the dataframe’s unique values. Removing duplicate records is sample

In [26]:
print('length of data before removing duplicates', len(movies))
movies = movies.drop_duplicates()
print('length of data after removing duplicates', len(movies))

length of data before removing duplicates 77
length of data after removing duplicates 74


In [27]:
data = {"Name": ["Jack", "Ali", "Phil", "Jack"],
"Age": [21, 28, 40, 21],
"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Age,Sex
0,Jack,21,Male
1,Ali,28,Female
2,Phil,40,Male
3,Jack,21,Male


In [28]:
df = df.drop_duplicates()
display(df)

Unnamed: 0,Name,Age,Sex
0,Jack,21,Male
1,Ali,28,Female
2,Phil,40,Male


In our example data, this could be useful if we had two entries for Name = Jack, one with Age = 21 and one with Age = 25. If we know we only want the oldest example for each person, we can sort by Age and drop duplicates of the name column, keeping only the observation with the highest age.

In [29]:
data = {"Name": ["Jack", "Ali", "Phil", "Jack"],
"Age": [21, 28, 40, 25],
"Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Age,Sex
0,Jack,21,Male
1,Ali,28,Female
2,Phil,40,Male
3,Jack,25,Male


In [30]:
df = df.sort_values('Age', ascending=False)
df = df.drop_duplicates(subset='Name', keep='first')
display(df)

Unnamed: 0,Name,Age,Sex
2,Phil,40,Male
1,Ali,28,Female
3,Jack,25,Male
