## Data Structures Review

I have a list of filenames. The filename needs to be uppercased, while the extension must remain the same. How do I accomplish this? I don't want to change my previous piece of data, so I must create a new data-structure to store my new files.

Instead of just viewing the solution, let's try to build up to this.

First, let's view our data.

In [4]:
files = ["upper_case_me.txt", "data.csv", "datafile.csv", "importantdata.tsv"]

So we have a list of strings, and we're trying to construct a new list of upper-cased strings. So, right off the bat, we know that we need to create an empty list to prepare a space for our new data.

We can also go ahead and write a Python control-flow structure that will *loop* through each string of our `files` list.

This will come in the form a for-loop, since this is the standard of getting values one-by-one from a list.

For more info on loops, check out [this link](https://www.dataquest.io/blog/python-for-loop-tutorial/).

In [None]:
# prepare a new list
upper_list = []

for file in files:
    # putting "pass" here temporarily
    pass

OK. I have my for-loop, which gets each string of the list, but how do I now uppercase only the name of the file?

Well, let's think back to what methods we've used on a string previously to seperate out a filename.

* There was the [split function](https://docs.python.org/3/library/stdtypes.html#str.split).
* There was also the option to use [find](https://docs.python.org/3/library/stdtypes.html#str.find).
* And lastly, we know we could also use the [pathlib module](https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix).

Since someone has already wrote out a module for us that will properly interact with a filename, I will choose to utilize `pathlib`.

If you use the alternative solutions, or find another alternative not yet listed, that will work just as well for today.

## Pathlib

Let's take a look at the pathlib [documentation](https://docs.python.org/3/library/pathlib.html#methods-and-properties). 

It looks like we actually have a pathlib property called [stem](https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.stem) which will extract the name of the file for us!

And, as we know already, we can get the suffix of the file by using `Path.suffix`

In [2]:
from pathlib import Path

# let's test this out
test = Path("filename.txt")
print(test.stem)
print(test.suffix)

filename
.txt


Now that we verified that we know what to use, let's tie it into our loop.

We will be taking the `file` variable and we will split it up into two values. The filename and the extension. 

Since our goal is to create a new UPPERCASED filename, let's use the [upper()](https://docs.python.org/3/library/stdtypes.html#str.upper) method we learned about previously. Keep in mind that `upper()` returns a new string, and does not modify the previous string. Therefore, we must save it into a new variable.

In [None]:
# prepare a new list
upper_list = []

for file in files:
    # seperate out name and suffix
    temp = Path(file)
    name = temp.stem
    ext = temp.suffix

    # captialize name
    new_name = name.upper()

OK. We have our upper-cased string name, we have our suffix. How can we recombine them into one string again?

This points us to concatenation. 

ex:
```
x = "hello"
y = "world"
print(x + y)
```

Notice that we are not interacting with any new concepts! Just knowing about reading docs and some base string operations is enough to competently create a solution.

Let us concatenate our new_name and the extension into a new variable as well.

Lastly, we will append this new value back into `upper_list`.

In [6]:
# prepare a new list
upper_list = []

for file in files:
    # seperate out name and suffix
    temp = Path(file)
    name = temp.stem
    ext = temp.suffix

    # captialize name
    new_name = name.upper()
    
    # concatenate new name and extension
    newf = new_name + ext
    upper_list.append(newf)

upper_list

['UPPER_CASE_ME.txt', 'DATA.csv', 'DATAFILE.csv', 'IMPORTANTDATA.tsv']

## Summarize

Let's break down the steps we took:

1. Observe data, consider how we can access data.
3. Plan how we can use functions to get to our goal.
3. Choose one path based on ease and reliability.
4. Implement planned methods.

These are general steps that we can decompose into further steps. 

Now that we've created our solution, let's ask ourselves, should this be a function?

Well we have to ask ourselves the following questions:

* will I use this chunk of code again?
* does this function do *one* thing
* is this code seperable from the rest of my code? (are variables independent?)
* will taking the code out of this function complicate my code?

Answering "yes" to any of these questions is generally a good-enough reason to make this block of code into a function.

Since this is a review, we will be creating a function for this script that we've written. Or rather, we've been given another batch of data to change. How should we encapsulate this into a function?

In [None]:
def upper_files(lst):
    """Uppercase all file names in `lst` param

    Parameters
    ----------
    lst:    list()
        A list of filenames. Must contain extensions.

    Returns
    -------
    list()
        A list of uppercased filenames.
    """
    upper_list = []

    for file in files:
        # seperate out name and suffix
        temp = Path(file)
        name = temp.stem
        ext = temp.suffix

        # captialize name
        new_name = name.upper()
        
        # concatenate new name and extension
        newf = new_name + ext
        upper_list.append(newf)

    return upper_list

## Scripting vs. OOP

So far, we've been using Python as a scripting language. Writing step-by-step directions to get to some final state. This is good for linear and non-reusable applications of programming.

Ex: "Can you combine all these csv files into one based on columns?" "Can you quickly analyze this data?"

In [None]:
# Script approach

transactions = [500, 200, -300, 300, 100, 500, 600, 100, -1000]

tot = sum(transactions)
largest = max(transactions)
smallest = min(transactions)

print("sum of transactions", tot)
print("highest", largest)
print("smallest", smallest)

Python really shines when used as an object-oriented language. That is, a language that groups together operations and variables into discrete groups we call "objects."

Read more here: https://en.wikipedia.org/wiki/Object-oriented_programming

## The Philosophy of OOP

Think of how we were raised to view the world.

The world is a large system of objects that interact with one another. These objects have characteristics & these objects can do actions.

I have a pen. This pen is an object that has the following characteristics:

* it's blue (color)
* it has a 7mm ballpoint (size)

This pen can do the following actions:

* it can draw on a variety of surfaces
* I can take the pen cap off
* I can put the pen cap on (if I don't lose it)

We interact with objects on a daily basis without even thinking about them. Now we are creating our own objects in the metaphysical universe of our program. 

Our language even shapes this perspective. But I do not want to get lost in the ether of discussion, let's get to programming.

## Builtin Objects

We've actually already have been interacting with objects all this time. Remember our strings? These are objects that have methods and variables attached to them. 

Each of these objects have:
* a type
* internal data representation (called instance variables)
* set of procedures (called instance method)

The documentation of strings will reveal the same. Here we see our list of methods that come attached with every single string object: https://docs.python.org/3/library/stdtypes.html#string-methods.

Since we don't really need any variables that we need to access from our string, we do not have instance varaibles in this case.

In [12]:
test = "hello"

# while we know a string is of type "string", we can also figure out that it is an object by looking at its doc
print(test.__doc__)

str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.


in fact, everything in Python is an object. Every data type that we use in Python has some builtin methods that give it some additional functionality.  

Below is a demonstration of how we can use our builtin methods associated with these objects.

In [17]:
# int object 
x = 5
print(x.bit_length())

# float object
y = 2.05
print(y.conjugate())

# Bool object
z = True
print(z.as_integer_ratio())

3
2.05
(1, 1)


## More Builtin Objects

Keep in mind, objects are just clumps of data & variables joined together for our own utility. 

Since we are interested in data analysis, let's get to know a few core modules that will allow us to do the most basic of data-processing, reading & outputting.

For this we must utilize the the objects that come with `csv` and `pathlib`.

We will go over these useful objects tomorrow when it comes to reading & writing data, but for now feel free to explore this documentation:

* https://docs.python.org/3/library/csv.html 
* https://docs.python.org/3/library/pathlib.html

## Primitives

In other programming lanugage, these "simple" data-types would be treated as something we call a primitive.

Think of it as a single piece of data that has no methods or variables attached to it. It is a singular and lonely piece of data that exists by itself in a program. This is not relevant for our discussions so we will ignore it for now.

We can see how primitives work here: https://www.geeksforgeeks.org/c-data-types/

## The Structure of an Object

Notice that there is a structure to calling a method from an object (a function that belongs to this object).

We do:

```
object_name.method()
```

This is the same for integers

```
x.bit_length()
```

For strings
```
test.upper()
```

And, later on, for objects that we create ourselves.

## Creating an object

To create an object that is not a builtin Python datatype, we almost always follow this format:

```
varName = class()
```

Specifically we do the following:

1. we name the object (just like we do for variables)
2. We set the object equal to the class name followed by paranthesis. Usually there will be some argument passed into the paranthesis as well, just like functions.

We see this present in the following lines of code where we are creating builtin objects.

In [None]:
from csv import reader
import pathlib

# name = class()
spamreader = reader("testfile.csv")
# name = class()
p = Path('.')

## Creating your own Class

Just like before, we won't always find ourselves using objects or functions that some other programmer wrote. Sometimes we will have to create our own objects for the sake of reusbility.

Say you have just created a data pipeline that contains a whole bunch of methods used for processing your data.

Say you realize that some of these functions are used for one portion of the pipeline, and the other portion is used for a completely different part of our processing. Maybe it would be a good idea to seperate these out into objects.

https://docs.python.org/3/tutorial/classes.html

## Create a Class

We start off with the `class` keyword. By principle, we always capitalize our class name.

```
class Name:
```

Next, we create a docstring. This will describe different parameters than our regular functions. We will go over the detail of this docstring later, as we want to focus on the implementation.

```
class Name:
    """
    """
```

We then implement something called the `__init__` method. This is our object [constructor](https://www.geeksforgeeks.org/constructors-in-python/), and this is what creates the actual object from the class definition. 

We also use this `__init__` method to create parameters that will act as our `instance variables`.

Inside of the parameter list we include the `self` keyword, which is necessary to tie functions to our classes. This parameter list of the `__init__` method is what we will use to assign arguments to our instance variables.

```
class Name:
    """
    """
    def __init__(self):
```

## Real Example

Since this is going to get very abstract, let's use a real example.

We want to create an object that processes and transforms a list of filenames. We want to make this an object because we want to associate many methods to this operation, and we also want to save some easily accessible and named instance variables.

Feel free to notice the structure of the docstring as well. We will go over how this differs from a function docstring.

In [19]:
class FileTransformer:
    """Class to transform files properly

    Instance Variables
    ------------------
    files:  list()
            List that contains filenames.

    Public Methods
    ----------
    fix_names(self):
        Function to remove all non-alphabet characters from this list. Returns a new list.
    extension_count(self):
        Function to count how many types of files we have. Returns a dictionary.
    """
    def __init__(self, files):
        self.files = files

Here we are creating a class called `FileTransformer`. Here, we are preparing to pass in a list of files as documented in the docstring, and as specified in the `__init__` method. Notice how we place the parameter called `files` after self.

Furthermore, we create an instance variable called `files` and assign it to the parameter that is passed in.

How do we distinguish between the parameter `files` and the instance variable `files`? We actually bring back that `self` keyword which tells Python that we are assigning this argument to the instance variable that belongs to this class.

## Adding Functions

Now that we have our class definition, and our constructor called `__init__`, we can also add in `instance methods` that will interact with our `instance variables`.

Note that we do not need to give a full docstring for non-public methods. Also keep in mind that any variable or any method that we create inside this class will once again only work within this class.

Think of this as an even more powerful function.

In [2]:
class FileTransformer:
    """Class to transform files properly

    Instance Variables
    ------------------
    files:  list()
            List that contains filenames.

    Public Methods
    ----------
    fix_names(self):
        Function to remove all non-alphabet characters from this list. Returns a new list.
    extension_count(self):
        Function to count how many types of files we have. Returns an int
    """
    def __init__(self, files):
        self.files = files
    
    def fix_names(self):
        """A function to remove all non-alpha chars from self.files"""
        return
    
    def extension_count(self, extension):
        """A function to count files of a certain extension"""
        # make a new list that only contains file extensions
        extensions = []
        for f in self.files:
            dot_index = f.find(".")
            result = f[dot_index + 1:]
            extensions.append(result)
        
        # count how many times this extension appears
        return extensions.count(extension)



## Classes vs. Objects

A Python class is our blueprint for our Python object. Once we've implemented this class, we can then create an object.

We usually put classes in a seperate file and sometimes even group them together with similar classes. We then import these classes via the `import` statement.

Keep in mind that the `object` is the actual bundle of data that you've created, while the `class` are all those variables and definitions that you've specified in a seperate file.

In [4]:
# here we are creating a FileTransformer object that we created above

# empty object (no data)
empty = FileTransformer([])

# access object instance var (note we usually do not want to encourage this)
print(empty.files)
# run instance methods
print(empty.extension_count("py"))

# test object (some data)
test = FileTransformer(["hello.py", "goodbye.txt", "data.csv"])

# run instance methods
print(test.extension_count("py"))


[]
0
1


## Class DocString

Notice that we did not need to write up a docstring for our non-public methods. This is the norm, as the class docstring will take care of describing all instance variables and function purpose.

The format for a class docstring is as follows:

```
    """[class description]

    Instance Variables
    ------------------
    [variable]:  [type]
            [variable description]

    Public Methods
    ----------
    [method name]
        [method description]
    """
```

Just like with functions, this goes immediately underneath our definition.


## Overriding Functions

We want to be able to compare and check for equality in our objects, so we could also implement the following methods to "override"

In [None]:
class FileTransformer:
    """Class to transform files properly

    Instance Variables
    ------------------
    files:  list()
            List that contains filenames.

    Public Methods
    ----------
    fix_names(self):
        Function to remove all non-alphabet characters from this list. Returns a new list.
    extension_count(self):
        Function to count how many types of files we have. Returns an int
    """
    def __init__(self, files):
        self.files = files
    
    def fix_names(self):
        """A function to remove all non-alpha chars from self.files"""
        return
    
    def extension_count(self, extension):
        """A function to count files of a certain extension"""
        # make a new list that only contains file extensions
        extensions = []
        for f in self.files:
            dot_index = f.find(".")
            result = f[dot_index + 1:]
            extensions.append(result)
        
        # count how many times this extension appears
        return extensions.count(extension)


    def __str__(self):
        return str(self.files)
    
    def __eq__(self, other):
        return self.files == other
