# What is data parsing?
If you google definitions, you will find a bit of everything. There's a lot about syntactic analysis, changing file format, reading by the computer, etc. None of these definitions is incorrect. Put in simple terms, **parsing** is reading and processing files and, to do so, you need to convert the contents of the files to a format that is readable (both for you and the computer) and modifiable. You can parse a file directly by looping through its lines, but it is also possible to use Python libraries that will help you with the parsing, such as **pandas** and **re**. Let's start with the simplest (and yet more complicated) scenario, where we parse ``parse.csv`` directly by looping through the file.

## Looping through lines in a file
``parse.csv`` contains some information about a series of superheroes and villains: their name, alias, type (if they are heroes or villains) and the city and country where they are active. Let's take a look at it:

In [76]:
with open("parse.csv") as file: # Open the file
    for line in file: # Loop through lines in the file
        print(line) # Print lines

first_name,last_name,alias,class,city,country

Lax,Salmonsson,The Salmon,hero,Uppsala,Sweden

Ivo,Yolk,Dr. Eggsalt,villain,Tomatown,Easterland

Luis,Gibert,Orceman,hero,Orce,Spain

Walter,Bruce,Batswede,hero,Gothamburg,Sweden

Jac,Skämt,Skojaren,villain,Gothamburg,Sweden

Juanito,Naipes,El Bromas,villain,Grandía,Spain

Banan,D'Artagnan,Musketeer,hero,Morecastle,France



In this case, we used the ``with open()`` function, that executes all the lines of code that are indented inside, and then it **automatically closes the file**. The ``open()`` function can be used with a simpler syntax, which doesn't require you to indent the blocks of code below, but also **doesn't automatically close the file**:

In [77]:
file = open("parse.csv") # Open the file
for line in file: # Loop through lines in the file
    print(line) # Print lines
file.close() # Close file

first_name,last_name,alias,class,city,country

Lax,Salmonsson,The Salmon,hero,Uppsala,Sweden

Ivo,Yolk,Dr. Eggsalt,villain,Tomatown,Easterland

Luis,Gibert,Orceman,hero,Orce,Spain

Walter,Bruce,Batswede,hero,Gothamburg,Sweden

Jac,Skämt,Skojaren,villain,Gothamburg,Sweden

Juanito,Naipes,El Bromas,villain,Grandía,Spain

Banan,D'Artagnan,Musketeer,hero,Morecastle,France



As you can see, in this case you need to use a ``close()`` method after you're done processing your file. For this reason, I recommend to **always use ``with open(<path to file>) as <variable name>``**. Remember that, if your file is not in the same folder where you run your notebook or script, you won't manage to open it unless you provide the **path** to the file from the directory where you run your script.

## Text processing
Now we have looked into how to open a file, but perhaps this is not the most readable format. For example, we have a **header** line with the names of the variables –perhaps we want to skip that. Or perhaps we want to embrace chaos, that's up to you. However, I will assume that you want to improve readability, how to do that? The first thing that we can do is to **split** the lines into words, using the comma as a separator, since it is a comma-separated file (``.csv``):

In [78]:
with open("parse.csv") as file:
    for line in file:
        line = line.split(",") # New line
        print(line)

['first_name', 'last_name', 'alias', 'class', 'city', 'country\n']
['Lax', 'Salmonsson', 'The Salmon', 'hero', 'Uppsala', 'Sweden\n']
['Ivo', 'Yolk', 'Dr. Eggsalt', 'villain', 'Tomatown', 'Easterland\n']
['Luis', 'Gibert', 'Orceman', 'hero', 'Orce', 'Spain\n']
['Walter', 'Bruce', 'Batswede', 'hero', 'Gothamburg', 'Sweden\n']
['Jac', 'Skämt', 'Skojaren', 'villain', 'Gothamburg', 'Sweden\n']
['Juanito', 'Naipes', 'El Bromas', 'villain', 'Grandía', 'Spain\n']
['Banan', "D'Artagnan", 'Musketeer', 'hero', 'Morecastle', 'France\n']


Now we have it as a list! We have used the ``split()`` method and asked it to split the text at every comma. This is a bit nicer, but there are still some issues. For example, we have a special character, the linebreak ``\n``, after the last element in the list. We can remove special characters and spaces at the start and end of a string with the method ``strip()``. Note that this method works on strings, not in lists. Therefore, we should use ``strip()`` before we use ``split()``.

In [86]:
with open("parse.csv") as file:
    for line in file:
        line = line.strip() # New line
        line = line.split(",")
        print(line)

['first_name', 'last_name', 'alias', 'class', 'city', 'country']
['Lax', 'Salmonsson', 'The Salmon', 'hero', 'Uppsala', 'Sweden']
['Ivo', 'Yolk', 'Dr. Eggsalt', 'villain', 'Tomatown', 'Easterland']
['Luis', 'Gibert', 'Orceman', 'hero', 'Orce', 'Spain']
['Walter', 'Bruce', 'Batswede', 'hero', 'Gothamburg', 'Sweden']
['Jac', 'Skämt', 'Skojaren', 'villain', 'Gothamburg', 'Sweden']
['Juanito', 'Naipes', 'El Bromas', 'villain', 'Grandía', 'Spain']
['Banan', "D'Artagnan", 'Musketeer', 'hero', 'Morecastle', 'France']


We got rid of the linebreaks! But this is still not the nicest way to look into our data, even though we can already do simple transformations, like printing a combination of the variables:

In [93]:
with open("parse.csv") as file:
    next(file) # Skip header
    for line in file:
        line = line.strip()
        line = line.split(",")
        print_text = f"{line[0]} {line[1]}, known as {line[2]}, {line[3]} of {line[4]} ({line[5]})" # String to print the info nicely
        print(print_text) # Modified line

Lax Salmonsson, known as The Salmon, hero of Uppsala (Sweden)
Ivo Yolk, known as Dr. Eggsalt, villain of Tomatown (Easterland)
Luis Gibert, known as Orceman, hero of Orce (Spain)
Walter Bruce, known as Batswede, hero of Gothamburg (Sweden)
Jac Skämt, known as Skojaren, villain of Gothamburg (Sweden)
Juanito Naipes, known as El Bromas, villain of Grandía (Spain)
Banan D'Artagnan, known as Musketeer, hero of Morecastle (France)


This looks better. Actually, I could be looking at this forever. In fact, maybe I should, but I need to save it somewhere so that I don't forget. Maybe it's about time you learn **how to write files**.

In [102]:
with open("parse.csv") as file:
    next(file)
    for line in file:
        line = line.strip()
        line = line.split(",")
        print_text = f"{line[0]} {line[1]}, known as {line[2]}, {line[3]} of {line[4]} ({line[5]})"
        with open("my_output.txt", "a+") as output: # Create and append to new file
            output.write(print_text + "\n") # Write the print_text variable to new file

Now I have a file I can look at forever! Unless it gets lost in the mess of folders that my home directory is, yay! As you might have noticed, I used an additional argument when calling the open function ``a+``. This arguments refer to the parsing mode. There are four basic parsing modes:

1. ``r``: read mode. This is the default. You can't modify the file that you open in this mode.

2. ``w``: write mode. This allows you to create a file if it doesn't exist and add content to it. If it exists, it overwrites the file.

3. ``a``: append mode. This adds content to a file that already exists.

4. ``x``: create mode. Creates the files if it doesn't exist; otherwise it prompts an error.

There are, however, other modes that you can use. For example, the ``a+`` mode creates the file if it doesn't exist and it adds the new contents. It is similar to the write mode, but it doesn't overwrite the file every time (which would be a problem, because we are running the ``write()`` function in a loop! This ``write()`` function is where we specify the content that we want to **write** to the file.

We can keep processing the file in many different ways. In this case, I will use a dictionary comprehension to store a list of first names under the key ``first_name``, a list of last names under the key ``last_name`` and so on. The first step to do so will be to store the header and the values in separated variables:

In [91]:
with open("parse.csv") as file:
    values = [] # Empty list to store values
    for line in file:
        line = line.strip()
        line = line.split(",")
        if "first_name" in line: # If it is the header
            header = line # Store in a new variable
        else: values.append(line) # Otherwise store in values

# Print new variables
print(header, end = "\n\n")
print(values)

['first_name', 'last_name', 'alias', 'class', 'city', 'country']

[['Lax', 'Salmonsson', 'The Salmon', 'hero', 'Uppsala', 'Sweden'], ['Ivo', 'Yolk', 'Dr. Eggsalt', 'villain', 'Tomatown', 'Easterland'], ['Luis', 'Gibert', 'Orceman', 'hero', 'Orce', 'Spain'], ['Walter', 'Bruce', 'Batswede', 'hero', 'Gothamburg', 'Sweden'], ['Jac', 'Skämt', 'Skojaren', 'villain', 'Gothamburg', 'Sweden'], ['Juanito', 'Naipes', 'El Bromas', 'villain', 'Grandía', 'Spain'], ['Banan', "D'Artagnan", 'Musketeer', 'hero', 'Morecastle', 'France']]


Now, we want to separate the values by header (first names, last names, etc.):

In [92]:
new_list = [[]] # Create and empty list of lists
[new_list.append([]) for n in range(len(keys)-1)] # Make it contain as many lists as headers we have

[new_list[i].append(val[i]) for val in values for i in range(len(val))] # Put together values that should have the same header

value_dict = {header[i]: new_list[i] for i in range(len(header)-1)} # Create dictionary with headers as keys
print(value_dict)

{'first_name': ['Lax', 'Ivo', 'Luis', 'Walter', 'Jac', 'Juanito', 'Banan'], 'last_name': ['Salmonsson', 'Yolk', 'Gibert', 'Bruce', 'Skämt', 'Naipes', "D'Artagnan"], 'alias': ['The Salmon', 'Dr. Eggsalt', 'Orceman', 'Batswede', 'Skojaren', 'El Bromas', 'Musketeer'], 'class': ['hero', 'villain', 'hero', 'hero', 'villain', 'villain', 'hero'], 'city': ['Uppsala', 'Tomatown', 'Orce', 'Gothamburg', 'Gothamburg', 'Grandía', 'Morecastle']}


This was quite complicated, wasn't it? Luckily, there is an easier way to parse ``csv`` files: using **pandas**.

### pandas
Pandas is a Python library that makes data parsing much easier. Here I won't show you all of its possible uses, just how to create a **dataframe** from our ``csv`` file, ``parse.csv``, and the properties of this data type. The first step, like when we use any other library or module, is to **import** the module. The syntax is quite simple:

In [103]:
import pandas

You can use aliases to import modules; for example, if the module has a really long name and you want to shorter it. In this case, we don't need to use an alias badly, but it is common to import ``pandas`` using the alias ``pd``:

In [104]:
import pandas as pd

Now that we have imported the module, let's move to more important matters: how to use it.

In [105]:
with open("parse.csv") as file:
    my_df = pd.read_csv(file)
    
print(my_df)

  first_name   last_name        alias    class        city     country
0        Lax  Salmonsson   The Salmon     hero     Uppsala      Sweden
1        Ivo        Yolk  Dr. Eggsalt  villain    Tomatown  Easterland
2       Luis      Gibert      Orceman     hero        Orce       Spain
3     Walter       Bruce     Batswede     hero  Gothamburg      Sweden
4        Jac       Skämt     Skojaren  villain  Gothamburg      Sweden
5    Juanito      Naipes    El Bromas  villain     Grandía       Spain
6      Banan  D'Artagnan    Musketeer     hero  Morecastle      France


Wow, so simple and easy compared with the scenario where we create the dictionary! All the values are already sorted into columns, and we didn't need to make any effort!

Let's look into some properties of this new, mysterious data class that has saved us so much trouble.

In [108]:
my_df["first_name"]

0        Lax
1        Ivo
2       Luis
3     Walter
4        Jac
5    Juanito
6      Banan
Name: first_name, dtype: object

Wow, we can retrieve first names so easily!

In [110]:
my_df.columns

Index(['first_name', 'last_name', 'alias', 'class', 'city', 'country'], dtype='object')

Wow, it has a propertie where the names of the columns are stored! Will it have one for the rows?

In [112]:
my_df.index

RangeIndex(start=0, stop=7, step=1)

Yes, it does! (Even though its name might be a bit less straightforward...). Can we get specific rows, like we did for the columns?

In [117]:
my_df[my_df.index == 3]

Unnamed: 0,first_name,last_name,alias,class,city,country
3,Walter,Bruce,Batswede,hero,Gothamburg,Sweden


Yes, we can! The logic behind dataframes is not different to that behind **SQL databases**, for those of you who know SQL. The idea is that it should be easy to retrieve the pieces of information that we want from the dataframe. We can also select several data matching a particular criterion, for example, all the heros:

In [119]:
my_df[my_df["class"] == "hero"]

Unnamed: 0,first_name,last_name,alias,class,city,country
0,Lax,Salmonsson,The Salmon,hero,Uppsala,Sweden
2,Luis,Gibert,Orceman,hero,Orce,Spain
3,Walter,Bruce,Batswede,hero,Gothamburg,Sweden
6,Banan,D'Artagnan,Musketeer,hero,Morecastle,France


And that's not all! We can even specify several conditions. For example, we might want to select all heroes in Sweden:

In [122]:
my_df[(my_df["class"] == "hero") & (my_df["country"] == "Sweden")]

Unnamed: 0,first_name,last_name,alias,class,city,country
0,Lax,Salmonsson,The Salmon,hero,Uppsala,Sweden
3,Walter,Bruce,Batswede,hero,Gothamburg,Sweden


**OBS!** If you want to use several conditions, don't forget to use parentheses around every condition.