# Dropping unnecessary columns in a DataFrame

__[Data Sets Can be found here](https://github.com/realpython/python-data-cleaning/tree/master/Datasets)__ 

In [1]:
import pandas as pd
import numpy as np

First, let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

In [2]:
df = pd.read_csv('/resources/data/BL-Flickr-Images-Book.csv')
df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,FORBES Walter.,,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,A. A. A.,BLAZE DE BURY Marie Pauline Rose - Baroness,,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,Bradbury Evans & Co.,Love the Avenger. By the author of “All for Gr...,A. A. A.,BLAZE DE BURY Marie Pauline Rose - Baroness,,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,Welsh Sketches chiefly ecclesiastical to the...,A. E. S.,Appleyard Ernest Silvanus.,,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,A new edition revised etc.,London,1857,Wertheim & Macintosh,[The World in which I live and my place in it...,A. E. S.,BROOME John Henry.,,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

We defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

In [3]:
to_drop = ['Edition Statement',
           'Corporate Author',
           'Corporate Contributors',
           'Former owner',
           'Engraver',
           'Contributors',
           'Issuance type',
           'Shelfmarks']

df.drop(to_drop, inplace = True, axis = 1)
df.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,A. A. A.,http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,Bradbury Evans & Co.,Love the Avenger. By the author of “All for Gr...,A. A. A.,http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,Welsh Sketches chiefly ecclesiastical to the...,A. E. S.,http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,[The World in which I live and my place in it...,A. E. S.,http://www.flickr.com/photos/britishlibrary/ta...


In [4]:
# Alternative options

# Alternatively, we could also remove the columns by passing them to the columns parameter directly 
# instead of separately specifying the labels to be removed and the axis where Pandas should look for 
# the labels:

# df.drop(columns=to_drop, inplace=True)

# If you know in advance which columns you’d like to retain, another option is to pass them to the 
# usecols argument of pd.read_csv.

# Changing the Index of a DataFrame

In many cases it is helpful to use a uniquely valued identifying field of the data as its index. For example, in the dataset used in the previous section, it can be expected that when a librarian searches for a record, they may input the unique identifier (values in the Identifier column) for a book:
Unlike primary keys in SQL, a Pandas Index doesn’t make any guarantee of being unique, although many indexing and merging operations will notice a speedup in runtime if it is. Previously, our index was a RangeIndex: integers starting from 0, analogous to Python’s built-in range. By passing a column name to set_index, we have changed the index to the values in Identifier.

In [5]:
df['Identifier'].is_unique

True

In [6]:
df.set_index('Identifier', inplace = True)
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,A. A. A.,http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,Bradbury Evans & Co.,Love the Avenger. By the author of “All for Gr...,A. A. A.,http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,Welsh Sketches chiefly ecclesiastical to the...,A. E. S.,http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,[The World in which I live and my place in it...,A. E. S.,http://www.flickr.com/photos/britishlibrary/ta...


In [7]:
# We can access each record in a straightforward way with loc[]. Although loc[] may not have all that 
# intuitive of a name, it allows us to do label-based indexing, which is the labeling of a row or record
# without regard to its position:

df.loc[206]

# In other words, 206 is the first label of the index. To access it by position, 
#we could use df.iloc[0], which does position-based indexing.

# df.iloc[0]


Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object

# Tidying up Fields in the Data

In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.
One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

In [8]:
df['Date of Publication'].head(25)


Identifier
206            1879 [1878]
216                   1868
218                   1869
472                   1851
480                   1857
481                   1875
519                   1872
667                    NaN
874                   1676
1143                  1679
1280                  1802
1808                  1859
1905                  1888
1929           1839, 38-54
2836                  1897
2854                  1865
2956               1860-63
2957                  1873
3017                  1866
3131                  1899
4598                  1814
4884                  1820
4976                  1800
5382    1847, 48 [1846-48]
5385               [1897?]
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

-Remove the extra dates in square brackets, wherever present: 1879 [1878]

-Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54

-Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]

-Convert the string nan to NumPy’s NaN value

The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions. (r'^(\d{4})
use of df['Date of Publication'].str. This attribute is a way to access speedy string operations in Pandas that largely mimic operations on native Python strings or compiled regular expressions, such as .split(), .replace(), and .capitalize().

In [9]:
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
extr.head()

Identifier
206    1879
216    1868
218    1869
472    1851
480    1857
Name: Date of Publication, dtype: object

Technically, this column still has object dtype, but we can easily get its numerical version with pd.to_numeric:


In [10]:
df['Date of Publication'] = pd.to_numeric(extr)
df['Date of Publication'].dtype

dtype('float64')

In [11]:
df['Date of Publication']

Identifier
206        1879.0
216        1868.0
218        1869.0
472        1851.0
480        1857.0
            ...  
4158088    1838.0
4158128    1831.0
4159563       NaN
4159587    1834.0
4160339    1834.0
Name: Date of Publication, Length: 8287, dtype: float64

# Combining str Methods with NumPy to Clean Columns

In [12]:
df['Place of Publication'].head(10)

Identifier
206                                  London
216                London; Virtue & Yorston
218                                  London
472                                  London
480                                  London
481                                  London
519                                  London
667     pp. 40. G. Bryan & Co: Oxford  1898
874                                 London]
1143                                 London
Name: Place of Publication, dtype: object

We see that for some rows, the place of publication is surrounded by other unnecessary information. If we were to look at more values, we would see that this is the case for only some rows that have their place of publication as ‘London’ or ‘Oxford’.

In [13]:
#Let’s take a look at two specific entries:
df.loc[4157862]

Place of Publication                                  Newcastle-upon-Tyne
Date of Publication                                                  1867
Publisher                                                      T. Fordyce
Title                   Local Records; or  Historical Register of rema...
Author                      FORDYCE  T. - Printer  of Newcastle-upon-Tyne
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

In [14]:
df.loc[4159587]

Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                  1834
Publisher                                                Mackenzie & Dent
Title                   An historical  topographical and descriptive v...
Author                                              Mackenzie  E. (Eneas)
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object

These two books were published in the same place, but one has hyphens in the name of the place while the other does not. To clean this column in one sweep, we can use str.contains() to get a boolean mask.

In [15]:
pub = df['Place of Publication']
london = pub.str.contains('London')
london[:5]

Identifier
206    True
216    True
218    True
472    True
480    True
Name: Place of Publication, dtype: bool

In [16]:
oxford = pub.str.contains('Oxford')

We combine them with np.where: Here, the np.where function is called in a nested structure, with condition being a Series of booleans obtained with str.contains(). The contains() method works similarly to the built-in in keyword used to find the occurrence of an entity in an iterable (or substring in a string).

In [17]:
df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace('-', ' ')))
df['Place of Publication'].head(25)

Identifier
206          London
216          London
218          London
472          London
480          London
481          London
519          London
667          Oxford
874          London
1143         London
1280       Coventry
1808    Christiania
1905        Firenze
1929      Amsterdam
2836         Savona
2854         London
2956          Paris
2957          Paris
3017    Puerto Rico
3131       New York
4598           Hull
4884         London
4976         Oxonii
5382         London
5385         London
Name: Place of Publication, dtype: object

In [18]:
# - is removed
df.loc[4157862]


Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                  1867
Publisher                                                      T. Fordyce
Title                   Local Records; or  Historical Register of rema...
Author                      FORDYCE  T. - Printer  of Newcastle-upon-Tyne
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

In [19]:
#Let’s have a look at the first ten entries, which look a lot crisper than when we started out:

df.head(10)

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879.0,S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London,1868.0,Virtue & Co.,All for Greed. [A novel. The dedication signed...,A. A. A.,http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869.0,Bradbury Evans & Co.,Love the Avenger. By the author of “All for Gr...,A. A. A.,http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851.0,James Darling,Welsh Sketches chiefly ecclesiastical to the...,A. E. S.,http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857.0,Wertheim & Macintosh,[The World in which I live and my place in it...,A. E. S.,http://www.flickr.com/photos/britishlibrary/ta...
481,London,1875.0,William Macintosh,[The World in which I live and my place in it...,A. E. S.,http://www.flickr.com/photos/britishlibrary/ta...
519,London,1872.0,The Author,Lagonells. By the author of Darmayne (F. E. A....,A. F. E.,http://www.flickr.com/photos/britishlibrary/ta...
667,Oxford,,,The Coming of Spring and other poems. By J. A...,A. J.|A. J.,http://www.flickr.com/photos/britishlibrary/ta...
874,London,1676.0,,A Warning to the inhabitants of England and L...,Remaʿ.,http://www.flickr.com/photos/britishlibrary/ta...
1143,London,1679.0,,A Satyr against Vertue. (A poem: supposed to b...,A. T.,http://www.flickr.com/photos/britishlibrary/ta...


# Cleaning the Entire Dataset Using the applymap Function

In certain situations, you will see that the “dirt” is not localized to one column but is more spread out. There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame. Pandas .applymap() method is similar to the in-built map() function and simply applies a function to all the elements in a DataFrame.

In [20]:
with open('/resources/data/university_towns.txt',"r") as file1:

    FileContent=file1.read(1000)
    
    print(FileContent)
    

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]
Arkadelphia (Henderson State University, Ouachita Baptist University)[2]
Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]
Fayetteville (University of Arkansas)[7]
Jonesboro (Arkansas State University)[8]
Magnolia (Southern Arkansas University)[2]
Monticello (University of Arkansas at Monticello)[2]
Russellville (Arkansas Tech University)[2]
Searcy (Harding University)[5]
California[edit]
Angwin (Pacific Union College

We see that we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2.... If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them. We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame:

In [21]:
university_towns = []


with open('/resources/data/university_towns.txt') as file:
      for line in file:
        if '[edit]' in line:
           # Remember this `state` until the next is found
            state = line
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            university_towns.append((state, line))
            
university_towns[:25]

[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
 ('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
 ('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
 ('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
 ('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n'),
 ('Alabama[edit]\n', 'Troy (Troy University)[2]\n'),
 ('Alabama[edit]\n',
  'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]\n'),
 ('Alabama[edit]\n', 'Tuskegee (Tuskegee University)[5]\n'),
 ('Alaska[edit]\n', 'Fairbanks (University of Alaska Fairbanks)[2]\n'),
 ('Arizona[edit]\n', 'Flagstaff (Northern Arizona University)[6]\n'),
 ('Arizona[edit]\n', 'Tempe (Arizona State University)\n'),
 ('Arizona[edit]\n', 'Tucson (University of Arizona)\n'),
 ('Arkansas[edit]\n',
  'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]\n'),
 ('Arkansas[edit]\n',
  'Conway (Central Baptist College, Hendrix College, Universit

We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.

In [22]:
towns_df = pd.DataFrame(university_towns, columns=['State', 'RegionName'])

In [23]:
towns_df.head(25)

Unnamed: 0,State,RegionName
0,Alabama[edit]\n,Auburn (Auburn University)[1]\n
1,Alabama[edit]\n,Florence (University of North Alabama)\n
2,Alabama[edit]\n,Jacksonville (Jacksonville State University)[2]\n
3,Alabama[edit]\n,Livingston (University of West Alabama)[2]\n
4,Alabama[edit]\n,Montevallo (University of Montevallo)[2]\n
5,Alabama[edit]\n,Troy (Troy University)[2]\n
6,Alabama[edit]\n,"Tuscaloosa (University of Alabama, Stillman Co..."
7,Alabama[edit]\n,Tuskegee (Tuskegee University)[5]\n
8,Alaska[edit]\n,Fairbanks (University of Alaska Fairbanks)[2]\n
9,Arizona[edit]\n,Flagstaff (Northern Arizona University)[6]\n


# Function to remove ( and ]

In [24]:
def get_citystate(item):
     if ' (' in item:
        return item[:item.find(' (')]
     elif '[' in item:
        return item[:item.find('[')]
     else:
        return item

Pandas’ .applymap() only takes one parameter, which is the function (callable) that should be applied to each element that is each cell value. The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value. It’s that simple!



In [None]:
towns_df =  towns_df.applymap(get_citystate)

In [104]:
towns_df.head(25)

Unnamed: 0,State,RegionName
0,Alabama,Auburn
1,Alabama,Florence
2,Alabama,Jacksonville
3,Alabama,Livingston
4,Alabama,Montevallo
5,Alabama,Troy
6,Alabama,Tuscaloosa
7,Alabama,Tuskegee
8,Alaska,Fairbanks
9,Arizona,Flagstaff


In [15]:
strname = ["Alaska[edit]"]
filered_str = strname[:6]
filered_str

['Alaska[edit]']

# Renaming columns and skipping rows

In [107]:
olympics_df = pd.read_csv('/resources/data/olympics.csv')
olympics_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,Unnamed: 16
0,,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02 !,03 !,Total,? Games,01 !,02 !,03 !,Combined total,
1,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2,
2,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15,
3,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70,
4,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12,


Observations:

The columns are the string form of integers indexed at 0. 

The row which should have been our header (i.e. the one to be used to set the column names) is at olympics_df.iloc[0]. This happened because our CSV file starts with 0, 1, 2, …, 15.

We’d see that NaN above should really be something like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be “Gold”, and so on.

Therefore, we need to do two things:

1. Skip one row and set the header as the first (0-indexed) row

2. Rename the columns




In [110]:
#We can skip rows and set the header while reading the CSV file by passing some parameters to the 
#read_csv() function.This function takes a lot of optional parameters, but in this case we only 
#need one (header) to remove the 0th row:

olympics_df = pd.read_csv('/resources/data/olympics.csv', header=1)
#olympics_df = pd.read_csv('Datasets\olympics.csv', skiprows = 1, header = 0)
olympics_df.head()

Unnamed: 0.1,Unnamed: 0,? Summer,01 !,02 !,03 !,Total,? Winter,01 !.1,02 !.1,03 !.1,Total.1,? Games,01 !.2,02 !.2,03 !.2,Combined total,Unnamed: 16
0,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2,
1,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15,
2,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70,
3,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12,
4,Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12,


In [111]:
#To rename the columns, we will make use of a DataFrame’s rename() method, which allows you to relabel
#an axis based on a mapping (in this case, a dict). Setting inplace to True specifies that our changes 
#be made directly to the object. Let’s see if this checks out:

new_names =  {'Unnamed: 0': 'Country',
              '? Summer': 'Summer Olympics',
              '01 !': 'Gold',
              '02 !': 'Silver',
              '03 !': 'Bronze',
              '? Winter': 'Winter Olympics',
              '01 !.1': 'Gold.1',
              '02 !.1': 'Silver.1',
              '03 !.1': 'Bronze.1',
              '? Games': '# Games', 
              '01 !.2': 'Gold.2',
              '02 !.2': 'Silver.2',
              '03 !.2': 'Bronze.2'}

olympics_df.rename(columns = new_names, inplace = True)

In [112]:
olympics_df.head()

Unnamed: 0,Country,Summer Olympics,Gold,Silver,Bronze,Total,Winter Olympics,Gold.1,Silver.1,Bronze.1,Total.1,# Games,Gold.2,Silver.2,Bronze.2,Combined total,Unnamed: 16
0,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2,
1,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15,
2,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70,
3,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12,
4,Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12,
