# Reading and Writing Files

This contains a few exercises focused on reading and writing files.  It uses the same 311 data as earlier exercises.

Relevant textbook section: [5.3 - Files](https://snakebear.science/05-StringsListsAndFiles/Files.html)

Each of the following steps builds on the previous.  For each, copy your code from the previous step and modify or add to it to achieve the additional goals in the new step.

## 1) Open a file and find weird entries

Data often contains weird things.  In the 311 data, there are a lot of incomplete entries or data that wasn't entered correctly.  We had to write a fair amount of code just to filter those out to do an analysis.  A first step in dealing with that is *finding* the weird/messy data.

Here, we'll look for entries containing the string ``'XXXX'``.

1. Open the file `311_Service_Requests_-_Abandoned_Vehicles_-_No_Duplicates.csv`.
2. [Search through the file](https://snakebear.science/05-StringsListsAndFiles/Files.html#searching-through-a-file) using a for loop and a conditional to find any lines containing ``'XXXX'``.
    - Refer to [the final example](https://snakebear.science/05-StringsListsAndFiles/Files.html#files11) in the linked section.
    - Use ``line.find()``.  For example:
```
          if line.find('XXXX') != -1:
              # this body will execute if the line does contain 'XXXX'
```
3. For any lines containing ``'XXXX'``, add 1 to a counter variable ``weird_count``.
4. Print out the value of ``weird_count`` when done.  It should be 113.

In [1]:
weird_count = 0  # count of weird lines

with open('311_Service_Requests_-_Abandoned_Vehicles_-_No_Duplicates.csv') as file:
    for line in file:
        if line.find('XXXX') != -1:
            weird_count = weird_count + 1

print(weird_count, "weird lines found.")

113 weird lines found.


## 2) Store weird entries in their own file

Now that you've found the weird entries, it might be useful to store them in their own file for further inspection and processing.

Copy your code from the previous part, and modify it to:

1. [Open a new file for writing](https://snakebear.science/05-StringsListsAndFiles/Files.html#writing-files) called ``311_weird.csv``.
   - To do this you might need to open two files simultaneously, because you will also need to read from the original data file.
2. For any 'weird' line found (as in Part 1), write that line into ``311_weird.csv``.

Open the new file (double-click on it in the file browser on the left) to check that it contains 113 lines, all containing ``'XXXX'`` somewhere.  (It might look like 112 lines, but the first line is being used as a header in Jupyterhub's viewer, so it doesn't get numbered.)

In [2]:
with open('311_weird.csv', 'w') as file_out:
    with open('311_Service_Requests_-_Abandoned_Vehicles_-_No_Duplicates.csv') as file:
        for line in file:
            if line.find('XXXX') != -1:
                file_out.write(line)

## 3) Write a "cleaned" version of the file.

It is also common to want to produce a "clean" version of some data, with entries you don't want to use removed.  It is important to store this in a **new** file, so that the original file is preserved.  Preserving the original data allows you or anyone else to **reproduce** your work by re-running your code on the original data.

Copy your code from the previous part, and modify it to:

1. Open a new file for writing called ``311_clean.csv``.  (This code doesn't need to open ``311_weird.csv`` any more.)
2. As you loop through the lines of the input file, use ``continue`` as shown in the textbook to **skip** any 'weird' lines, where 'weird' here means any of the following:
   1. The line contains ``'XXXX'``
   2. The line, when split into a list on the comma character, has fewer than 17 entries.  (Most lines contain 17 values.  A few do not.)
      - To split the line, use:
      ```
            entries = line.split(',')
      ```
   3. The entry at index 10 in ``entries`` is not numeric.  (Index 10 should be a number of days.  If it doesn't hold a number, then we'll treat it as invalid data.)
      - Write a conditional using:
      ```
            not entries[10].isnumeric()
      ```
3. For any lines *not* skipped, write them into the ``311_cleaned.csv`` file.
4. Count the number of lines written into ``311_cleaned.csv`` and print out the count.

**Tip:** Start with just one step of filtering, get it working, and then add the next two, one at a time.

The final count should be 146383.

In [3]:
good_count = 0   # number of good lines written into 311_clean.csv

with open('311_clean.csv', 'w') as file_out:
    with open('311_Service_Requests_-_Abandoned_Vehicles_-_No_Duplicates.csv') as file:
        for line in file:
            # Skip lines with 'XXXX'
            if line.find('XXXX') != -1:
                continue
            
            entries = line.split(',')
            # Skip lines with < 17 values
            if len(entries) < 17:
                continue
            
            # Skip lines where the value at index 10 is not numeric
            if not entries[10].isnumeric():
                continue
            
            # If we get here, none of the checks above were triggered and the line is "clean"
            file_out.write(line)
            good_count = good_count + 1

print(good_count, "lines written to 311_clean.csv.")

146383 lines written to 311_clean.csv.


***

And that's it for this exercise.  With the cleaned data, you could write an analysis program with fewer checks for bad data, making it cleaner and easier to understand.  Separating out the cleaning from the analysis makes each part easier to work with and more reusable.