![cover](cover/04.%20CSV%20Files.png)

#### Outline
* Reading a CSV File
* Editing a CSV File
* Properties
* Parameters and Functions

##### Parameters and Functions
* nrows
* usecols
* skiprows
* index_col
* header
* names
* head()
* tail()
* dtype

##### Import Pandas

In [1]:
import pandas

### Reading a CSV File
`pandas.read_csv(file_location)`

In [2]:
students = pandas.read_csv("./datasets/students.csv")
print(students)

   first_name   surname sex date_of_birth
0        Emma   Johnson   F    2001-03-15
1        Liam     Brown   M    2003-07-22
2      Olivia    Garcia   F    2002-11-05
3        Noah    Miller   M    2004-09-30
4         Ava     Davis   F    2000-12-11
5       Ethan  Martinez   M    2002-04-19
6      Sophia    Wilson   F    2005-01-28
7       Mason  Anderson   M    2003-06-10
8    Isabella    Thomas   F    2001-08-23
9       Logan    Taylor   M    2004-05-07
10        Mia     Moore   F    2000-02-17
11      Lucas   Jackson   M    2005-10-02


### Properties
* `type()`
* `columns`
* `info()`
* `shape`
* `size`

In [3]:
type(students)

pandas.core.frame.DataFrame

In [4]:
students.columns

Index(['first_name', 'surname', 'sex', 'date_of_birth'], dtype='object')

In [5]:
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   first_name     12 non-null     object
 1   surname        12 non-null     object
 2   sex            12 non-null     object
 3   date_of_birth  12 non-null     object
dtypes: object(4)
memory usage: 516.0+ bytes


In [6]:
students.shape

(12, 4)

In [7]:
students.size

48

### Editing a CSV File

### Parameters

#### nrows
`nrows=n`

Defines how many rows of the dataset we want to use

In [8]:
books = pandas.read_csv("./datasets/goodreads.csv")
print(books)

    rank                                  name             author  avg_rating  \
0      1                 To Kill a Mockingbird         Harper Lee        4.26   
1      2                 The Lord of the Rings     J.R.R. Tolkien        4.54   
2      3            The Wonderful Wizard of Oz      L. Frank Baum        4.00   
3      4            The Fellowship of the Ring     J.R.R. Tolkien        4.40   
4      5                        The Two Towers     J.R.R. Tolkien        4.50   
5      6       One Flew Over the Cuckoo’s Nest          Ken Kesey        4.20   
6      7  Harry Potter and the Deathly Hallows       J.K. Rowling        4.62   
7      8                The Return of the King     J.R.R. Tolkien        4.58   
8      9                    Gone with the Wind  Margaret Mitchell        4.31   
9     10                           The Shining       Stephen King        4.28   
10    11                        V for Vendetta         Alan Moore        4.26   
11    12                    

In [9]:
books_5 = pandas.read_csv("./datasets/goodreads.csv", nrows=5)
print(books_5)

   rank                        name          author  avg_rating  no_ratings
0     1       To Kill a Mockingbird      Harper Lee        4.26     6770621
1     2       The Lord of the Rings  J.R.R. Tolkien        4.54      723003
2     3  The Wonderful Wizard of Oz   L. Frank Baum        4.00      504274
3     4  The Fellowship of the Ring  J.R.R. Tolkien        4.40     3094267
4     5              The Two Towers  J.R.R. Tolkien        4.50     1102518


#### usecols
`usecols=[a,b,...,xn]`

Defines which columns in the dataset we will use, these can be defined by their names, or indices, given by an integer

In [10]:
books_3 = pandas.read_csv("./datasets/goodreads.csv", nrows=3, usecols=["name", "avg_rating"])
print(books_3)

                         name  avg_rating
0       To Kill a Mockingbird        4.26
1       The Lord of the Rings        4.54
2  The Wonderful Wizard of Oz        4.00


In [11]:
books_3 = pandas.read_csv("./datasets/goodreads.csv", nrows=3, usecols=[1,3])
print(books_3)

                         name  avg_rating
0       To Kill a Mockingbird        4.26
1       The Lord of the Rings        4.54
2  The Wonderful Wizard of Oz        4.00


#### skiprows
`skiprows=[a,b,...,xn]`

Defines which rows in the dataset we will skip, these can be defined by their names, or indices, given by an integer

In [12]:
films = pandas.read_csv("./datasets/imdb.csv")
print(films)

    rank                                               name  \
0      1                           The Shawshank Redemption   
1      2                                      The Godfather   
2      3                                    The Dark Knight   
3      4                              The Godfather Part II   
4      5                                       12 Angry Men   
5      6      The Lord of the Rings: The Return of the King   
6      7                                   Schindler's List   
7      8  The Lord of the Rings: The Fellowship of the Ring   
8      9                                       Pulp Fiction   
9     10                      The Good the Bad and the Ugly   
10    11                                       Forrest Gump   
11    12              The Lord of the Rings: The Two Towers   

                director  length  avg_rating  no_ratings  
0         Frank Darabont  2h 22m         9.3     3100000  
1   Francis Ford Coppola  2h 55m         9.2     2200000  
2  

In [13]:
short_name_films = pandas.read_csv("./datasets/imdb.csv", skiprows=[1,4,6,8,10,12])
print(short_name_films)

   rank              name              director  length  avg_rating  \
0     2     The Godfather  Francis Ford Coppola  2h 55m         9.2   
1     3   The Dark Knight     Christopher Nolan  2h 32m         9.1   
2     5      12 Angry Men          Sidney Lumet  1h 36m         9.0   
3     7  Schindler's List      Steven Spielberg  3h 15m         9.0   
4     9      Pulp Fiction     Quentin Tarantino  2h 34m         8.8   
5    11      Forrest Gump       Robert Zemeckis  2h 22m         8.8   

   no_ratings  
0     2200000  
1     3100000  
2      953000  
3     1600000  
4     2400000  
5     2400000  


#### index_col
`index_col=n`

Sets a column of the dataset as the index column

In [14]:
movies_ranked = pandas.read_csv("./datasets/imdb.csv", index_col=0)
print(movies_ranked)

                                                   name              director  \
rank                                                                            
1                              The Shawshank Redemption        Frank Darabont   
2                                         The Godfather  Francis Ford Coppola   
3                                       The Dark Knight     Christopher Nolan   
4                                 The Godfather Part II  Francis Ford Coppola   
5                                          12 Angry Men          Sidney Lumet   
6         The Lord of the Rings: The Return of the King         Peter Jackson   
7                                      Schindler's List      Steven Spielberg   
8     The Lord of the Rings: The Fellowship of the Ring         Peter Jackson   
9                                          Pulp Fiction     Quentin Tarantino   
10                        The Good the Bad and the Ugly          Sergio Leone   
11                          

#### header
`header=n` --> can be `None` type

Sets a row of the dataset as the header (top row) and to skip those above, as defined by its index

In [15]:
print(films)

    rank                                               name  \
0      1                           The Shawshank Redemption   
1      2                                      The Godfather   
2      3                                    The Dark Knight   
3      4                              The Godfather Part II   
4      5                                       12 Angry Men   
5      6      The Lord of the Rings: The Return of the King   
6      7                                   Schindler's List   
7      8  The Lord of the Rings: The Fellowship of the Ring   
8      9                                       Pulp Fiction   
9     10                      The Good the Bad and the Ugly   
10    11                                       Forrest Gump   
11    12              The Lord of the Rings: The Two Towers   

                director  length  avg_rating  no_ratings  
0         Frank Darabont  2h 22m         9.3     3100000  
1   Francis Ford Coppola  2h 55m         9.2     2200000  
2  

In [16]:
films1 = pandas.read_csv("./datasets/imdb.csv", header=None, skiprows=[0])
print(films1)

     0                                                  1  \
0    1                           The Shawshank Redemption   
1    2                                      The Godfather   
2    3                                    The Dark Knight   
3    4                              The Godfather Part II   
4    5                                       12 Angry Men   
5    6      The Lord of the Rings: The Return of the King   
6    7                                   Schindler's List   
7    8  The Lord of the Rings: The Fellowship of the Ring   
8    9                                       Pulp Fiction   
9   10                      The Good the Bad and the Ugly   
10  11                                       Forrest Gump   
11  12              The Lord of the Rings: The Two Towers   

                       2       3    4        5  
0         Frank Darabont  2h 22m  9.3  3100000  
1   Francis Ford Coppola  2h 55m  9.2  2200000  
2      Christopher Nolan  2h 32m  9.1  3100000  
3   Franci

In [17]:
films2 = pandas.read_csv("./datasets/imdb.csv", header=4, skiprows=1)
print(films2)

    5                                       12 Angry Men       Sidney Lumet  \
0   6      The Lord of the Rings: The Return of the King      Peter Jackson   
1   7                                   Schindler's List   Steven Spielberg   
2   8  The Lord of the Rings: The Fellowship of the Ring      Peter Jackson   
3   9                                       Pulp Fiction  Quentin Tarantino   
4  10                      The Good the Bad and the Ugly       Sergio Leone   
5  11                                       Forrest Gump    Robert Zemeckis   
6  12              The Lord of the Rings: The Two Towers      Peter Jackson   

   1h 36m  9.0   953000  
0  3h 21m  9.0  2100000  
1  3h 15m  9.0  1600000  
2  2h 58m  8.9  2100000  
3  2h 34m  8.8  2400000  
4  2h 28m  8.8   871000  
5  2h 22m  8.8  2400000  
6  2h 59m  8.8  1900000  


#### names
`names=["",..,""]`

Sets header names for the dataset

In [18]:
menu = pandas.read_csv("./datasets/menu_items.csv")
print(menu)

                     name  cost_to_produce  cost_to_customer   category
0            Cheeseburger             1.25              4.99       food
1                   Fries             0.60              2.49       food
2   Chicken Nuggets (6pc)             1.10              3.99       food
3     Soft Drink (Medium)             0.25              1.89      drink
4               Milkshake             0.90              3.49      drink
5        Ice Cream Sundae             0.80              2.99    dessert
6          Ketchup Packet             0.02              0.10  condiment
7           BBQ Sauce Cup             0.05              0.25  condiment
8               Apple Pie             0.70              2.59    dessert
9             Veggie Wrap             1.50              5.49       food
10          Bottled Water             0.20              1.49      drink
11       Chicken Sandwich             1.40              4.79       food


In [19]:
short_menu = pandas.read_csv("./datasets/menu_items.csv", names=["name", "cost", "sell_price", "category"], skiprows=1)
print(short_menu)

                     name  cost  sell_price   category
0            Cheeseburger  1.25        4.99       food
1                   Fries  0.60        2.49       food
2   Chicken Nuggets (6pc)  1.10        3.99       food
3     Soft Drink (Medium)  0.25        1.89      drink
4               Milkshake  0.90        3.49      drink
5        Ice Cream Sundae  0.80        2.99    dessert
6          Ketchup Packet  0.02        0.10  condiment
7           BBQ Sauce Cup  0.05        0.25  condiment
8               Apple Pie  0.70        2.59    dessert
9             Veggie Wrap  1.50        5.49       food
10          Bottled Water  0.20        1.49      drink
11       Chicken Sandwich  1.40        4.79       food


#### dtype
`dtype={"" : ""}`

Changes the datatype of any particular column

In [20]:
employees = pandas.read_csv("./datasets/employees.csv")
print(employees)

   first_name   surname sex                       email  \
0        John     Smith   M      john.smith@company.com   
1       Emily     Jones   F     emily.jones@company.com   
2     Michael     Brown   M   michael.brown@company.com   
3       Sarah    Wilson   F    sarah.wilson@company.com   
4       David   Johnson   M   david.johnson@company.com   
5        Anna    Garcia   F     anna.garcia@company.com   
6       James    Miller   M    james.miller@company.com   
7       Laura  Martinez   F  laura.martinez@company.com   
8      Robert     Davis   M    robert.davis@company.com   
9      Sophia     Lopez   F    sophia.lopez@company.com   
10     Daniel    Taylor   M   daniel.taylor@company.com   
11     Olivia     Clark   F    olivia.clark@company.com   

                     role  salary  
0                 Manager   62000  
1            Data Analyst   52000  
2       Software Engineer   58000  
3           HR Specialist   47000  
4         Sales Executive   54000  
5   Marketing Co

In [21]:
employees = pandas.read_csv("./datasets/employees.csv", dtype={"salary":"float"})
print(employees)

   first_name   surname sex                       email  \
0        John     Smith   M      john.smith@company.com   
1       Emily     Jones   F     emily.jones@company.com   
2     Michael     Brown   M   michael.brown@company.com   
3       Sarah    Wilson   F    sarah.wilson@company.com   
4       David   Johnson   M   david.johnson@company.com   
5        Anna    Garcia   F     anna.garcia@company.com   
6       James    Miller   M    james.miller@company.com   
7       Laura  Martinez   F  laura.martinez@company.com   
8      Robert     Davis   M    robert.davis@company.com   
9      Sophia     Lopez   F    sophia.lopez@company.com   
10     Daniel    Taylor   M   daniel.taylor@company.com   
11     Olivia     Clark   F    olivia.clark@company.com   

                     role   salary  
0                 Manager  62000.0  
1            Data Analyst  52000.0  
2       Software Engineer  58000.0  
3           HR Specialist  47000.0  
4         Sales Executive  54000.0  
5   Market

### Functions

#### head()
`head(n)`

Selects the first `n` rows of the dataset, (default=5)

In [22]:
print(employees)

   first_name   surname sex                       email  \
0        John     Smith   M      john.smith@company.com   
1       Emily     Jones   F     emily.jones@company.com   
2     Michael     Brown   M   michael.brown@company.com   
3       Sarah    Wilson   F    sarah.wilson@company.com   
4       David   Johnson   M   david.johnson@company.com   
5        Anna    Garcia   F     anna.garcia@company.com   
6       James    Miller   M    james.miller@company.com   
7       Laura  Martinez   F  laura.martinez@company.com   
8      Robert     Davis   M    robert.davis@company.com   
9      Sophia     Lopez   F    sophia.lopez@company.com   
10     Daniel    Taylor   M   daniel.taylor@company.com   
11     Olivia     Clark   F    olivia.clark@company.com   

                     role   salary  
0                 Manager  62000.0  
1            Data Analyst  52000.0  
2       Software Engineer  58000.0  
3           HR Specialist  47000.0  
4         Sales Executive  54000.0  
5   Market

In [23]:
print(employees.head())

  first_name  surname sex                      email               role  \
0       John    Smith   M     john.smith@company.com            Manager   
1      Emily    Jones   F    emily.jones@company.com       Data Analyst   
2    Michael    Brown   M  michael.brown@company.com  Software Engineer   
3      Sarah   Wilson   F   sarah.wilson@company.com      HR Specialist   
4      David  Johnson   M  david.johnson@company.com    Sales Executive   

    salary  
0  62000.0  
1  52000.0  
2  58000.0  
3  47000.0  
4  54000.0  


In [24]:
print(employees.head(4))

  first_name surname sex                      email               role  \
0       John   Smith   M     john.smith@company.com            Manager   
1      Emily   Jones   F    emily.jones@company.com       Data Analyst   
2    Michael   Brown   M  michael.brown@company.com  Software Engineer   
3      Sarah  Wilson   F   sarah.wilson@company.com      HR Specialist   

    salary  
0  62000.0  
1  52000.0  
2  58000.0  
3  47000.0  


In [25]:
print(employees.head(1))

  first_name surname sex                   email     role   salary
0       John   Smith   M  john.smith@company.com  Manager  62000.0


#### tail()
`tail(n)`

Selects the last `n` rows of the dataset, (default=5)

In [26]:
print(students)

   first_name   surname sex date_of_birth
0        Emma   Johnson   F    2001-03-15
1        Liam     Brown   M    2003-07-22
2      Olivia    Garcia   F    2002-11-05
3        Noah    Miller   M    2004-09-30
4         Ava     Davis   F    2000-12-11
5       Ethan  Martinez   M    2002-04-19
6      Sophia    Wilson   F    2005-01-28
7       Mason  Anderson   M    2003-06-10
8    Isabella    Thomas   F    2001-08-23
9       Logan    Taylor   M    2004-05-07
10        Mia     Moore   F    2000-02-17
11      Lucas   Jackson   M    2005-10-02


In [27]:
print(students.tail())

   first_name   surname sex date_of_birth
7       Mason  Anderson   M    2003-06-10
8    Isabella    Thomas   F    2001-08-23
9       Logan    Taylor   M    2004-05-07
10        Mia     Moore   F    2000-02-17
11      Lucas   Jackson   M    2005-10-02


In [28]:
print(students.tail(7))

   first_name   surname sex date_of_birth
5       Ethan  Martinez   M    2002-04-19
6      Sophia    Wilson   F    2005-01-28
7       Mason  Anderson   M    2003-06-10
8    Isabella    Thomas   F    2001-08-23
9       Logan    Taylor   M    2004-05-07
10        Mia     Moore   F    2000-02-17
11      Lucas   Jackson   M    2005-10-02


In [29]:
print(short_menu.tail(3))

                name  cost  sell_price category
9        Veggie Wrap   1.5        5.49     food
10     Bottled Water   0.2        1.49    drink
11  Chicken Sandwich   1.4        4.79     food


### For Source code:
https://sites.google.com/view/aorbtech/programming/

#### @Aorb Tech