# DATA-531 Lecture 4

#### (From previous class) Can labmda return 2 values?

- ```lambda x: (x+1, x+2)``` --> tuple
- ```lambda x: [x+1, x+2]``` --> list
- ```lambda x: x+1, x+2``` --> Error

Outline:

- Python classes
- Python `import`
- Importing your own functions  
- Intriguing behaviour in Python 
- References 
- Function calls and references 
- `copy` and `deepcopy` 
- Scoping 
- Pandas

In [3]:
import numpy as np

## Python Classes (20 min)

- We've seen data types like `dict` (built in to Python) and `np.ndarray` (3rd party library). 
- Today we'll see how to create our own data types. 
- These are called **classes** and an instance is called an **object**. (Classes documentation [here](https://docs.python.org/3/tutorial/classes.html).)
- For our purposes, a type and a class are the same thing. Some discussion of the differences [here](https://stackoverflow.com/questions/468145/what-is-the-difference-between-type-and-class).
- The general approach to programming using classes and objects is called [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming).

In [4]:
d = dict()

Here, `d` is an object, whereas `dict` is a type. 

In [5]:
type(d)

dict

In [6]:
type(dict)

type

We say `d` is an **instance** of type `dict`. Hence

In [7]:
isinstance(d, dict)

True

#### Why create your own types/classes?

- Example: a circle in 2D space
- You want to be able to _change_ the circle in several ways: move it or make it bigger or smaller.
- You want to be able to compute properties of the circle: its area, circumference, and its distance to the origin.

In [8]:
x = 2.0
y = 3.0
r = 1.0 # radius

def area(r):
    """Compute the area of a circle with radius r."""
    return np.pi * r**2

def circumference(r):
    """Compute the circumference of a circle with radius r."""
    return 2.0 * np.pi * r

def dist(x, y, r):
    """Compute the distance to the origin from a circle with centre x, y and radius r."""
    return np.abs((np.sqrt(x**2 + y**2) - r))

In [9]:
dist(x, y, r)

2.605551275463989

In [10]:
area(r)

3.141592653589793

Now let's say you want two circles...

This approach is very clunky. What if you accidentally call

In [11]:
x2= 3.0
y2= 4.0
r2= 2.0

dist(x2, y2, r) # use the radius of the other circle by accident

4.0

Ok, so maybe you can wrap everything in dictionaries:

In [13]:
circle1 = {"x" : x,
           "y" : y,
           "r" : r}

circle2 = {"x" : x2,
           "y" : y2,
           "r" : r2}

#dist(circle1["x"], circle1["y"], circle1["r"])
dist(**circle1)# fancy syntax to "unpack" a dictionary into the arguments of a function

2.605551275463989

The above is slightly better, but still awkward. For example, you might accidentally do

In [None]:
circle3 = {"x" : 5,
           "z" : 2,  # now circle3 has different property names by accident
           "r" : 3}

In [None]:
dist(**circle3)

- Classes allow us to enforce the _structure of our data_.
  - That is, a circle contains a $x$, $y$, and $r$.
- It also helps writing functions, as you'll see.
  - Above, all our functions had to take in the same data and re-explain the arguments.

#### Making a class

- The syntax below creates a class, or type, called `Circle`. 
- The functions defined inside a class are called **methods**.
- The `__init__` method is run when you create a new instance of the class (i.e. a new `Circle` object).

In [14]:
class Circle:
    """A circle with a centre (x,y) and radius r."""
    
    def __init__(self, x, y, r):
        self.x = x
        self.y = y
        self.r = r

Let's re-create `circle1`:

In [15]:
circle1 = Circle(2.0, 3.0, 1.0)

In [16]:
type(circle1)

__main__.Circle

In [17]:
circle1.x # retrieve one of the fields

2.0

Let's now implement the methods:

In [19]:
class Circle:
    """A circle with a centre (x,y) and radius r."""
    
    def __init__(self, x, y, r=1.0):
        # For those familiar with a "constructor" - this is it!
        self.x = x
        self.y = y
        self.r = r
        
    def area(self):
        """Compute the area of a circle with radius r."""
        return np.pi * self.r**2

    def circumference(self):
        """Compute the circumference of a circle with radius r."""
        return 2.0 * np.pi * self.r

    def dist(self):
        """Compute the distance to the origin."""
        return np.abs(np.sqrt(self.x**2 + self.y**2) - self.r)

Some things to note:

- The inputs to the methods are just `self`. 
- This `self` object is literally itself; thus, it gives you access to all the data inside the class using `self.x`, etc. 
- No need to re-explain the arguments each time, just explain the data at the start of the class.
  - This makes the code cleaner, more reusable and more modular.
- We call the functions with the `.`

In [20]:
circle1 = Circle(2.0, 3.0, 1.0)

In [21]:
circle1.area()

3.141592653589793

In [22]:
circle1.dist()

2.605551275463989

In fact, we've seen this before:

In [23]:
d = dict()

for key, val in d.items():
    pass

This is the same `.` because `items` is a method of the `dict` class.

In [24]:
a = np.random.randint(10, size=8) # make a numpy array
a

array([8, 9, 8, 3, 9, 1, 2, 6])

In [25]:
#property/field of the class numpy
a.shape 

(8,)

In [None]:
#property/field of the class numpy
a.size

These are fields of the `ndarray` object. Here is a method:

In [None]:
a.sort()
a

- Now imagine we also wanted a function to compute the distance between two circles.
- This would have been a pain before:

In [None]:
def dist_between(x1, y1, r1, x2, y2, r2):
    """
    Compute the distance between one circle and another circle.
    
    Arguments:
    x1 -- (float) x-coordinate of the centre of the first circle
    y1 -- (float) y-coordinate of the centre of the first circle
    r1 -- (float) radius of the first circle
    x2 -- (float) x-coordinate of the centre of the second circle
    y2 -- (float) y-coordinate of the centre of the second circle
    r2 -- (float) radius of the second circle
    """
    return np.sqrt((x1 - x2)**2 + (y1 - y2)**2) - (r1 + r2)

dist_between(x, y, r, x2, y2, r2)

In [26]:
class Circle:
    """A circle with a centre (x,y) and radius r."""
    
    def __init__(self, x, y, r):
        self.x = x
        self.y = y
        self.r = r
        
    def area(self):
        """Compute the area of a circle with radius r."""
        return np.pi * self.r**2

    def circumference(self):
        """Compute the circumference of a circle with radius r."""
        return 2.0 * np.pi * self.r
    def dist(self):
        """Compute the distance to the origin."""
        return np.abs(np.sqrt(self.x**2 + self.y**2) - self.r)
    
    def dist_between(self, other):
        """
        Compute the distance between this circle and another circle.
        
        Parameters
        ----------
        other : Circle
            the other circle.
        """
        if not isinstance(other, Circle):
            raise Exception("other must be a Circle!!!")
        
        return np.sqrt((self.x - other.x)**2 + (self.y - other.y)**2) - (self.r+ other.r)

In [27]:
circle1 = Circle(2.0, 3.0, 1.0)
circle2 = Circle(8,9,0.1)

In [28]:
circle2.dist_between(circle1)

7.38528137423857

#### Changing data in a class

- Classes you create are generally mutable.
- You can directly change the data like this:

In [None]:
circle1.circumference()

In [39]:
circle1.r = 10
circle1.circumference()

62.83185307179586

You can also create methods that allow the user to change the object:

In [40]:
class Circle:
    """A circle with a centre (x,y) and radius r."""
    
    def __init__(self, x, y, r):
        self.x = x
        self.y = y
        self.r = r
        
    def area(self):
        """Compute the area of a circle with radius r."""
        return np.pi * self.r**2

    def circumference(self):
        """Compute the circumference of a circle with radius r."""
        return 2.0 * np.pi * self.r
    def dist(self):
        """Compute the distance to the origin."""
        return np.abs(np.sqrt(self.x**2 + self.y**2) - self.r)
    
    def dist_between(self, other):
        """
        Compute the distance between this circle and another circle.
        
        Parameters
        ----------
        other : Circle
            the other circle.
        """
        if not isinstance(other, Circle):
            raise Exception("other must be a Circle!!!")
                 
    def translate(self, Δx, Δy):
        """Move the circle by (Δx, Δy)"""
        self.x += Δx
        self.y += Δy
        return self # This is not needed, but is sometimes convenient (method cascading).

In [30]:
circle1 = Circle(2.0, 3.0, 1.0)

In [31]:
circle1.dist()

2.605551275463989

In [32]:
circle1.translate(10, 10)
circle1.dist()

16.69180601295413

#### Other special methods

- Aside from `__init__`, there are other special methods you might find useful.
- For example, what if we want to print our object.

In [41]:
print(circle1)

A Circle at (2.0, 3.0) with radius 10.0.


- This doesn't look very good.
- But other objects, like numpy arrays, print out nicely:

In [None]:
print(a)

- To specify how our object is printed, we can define a method called `__str__` ([Python documentation](https://docs.python.org/3/reference/datamodel.html#object.__str__)).

In [47]:
class Circle:
    """A circle with a centre (x,y) and radius r."""
    
    def __init__(self, x, y, r):
        self.x = x
        self.y = y
        self.r = r
        self.area = np.pi * self.r**2
        
    def area(self):
        """Compute the area of a circle with radius r."""
        return np.pi * self.r**2

    def circumference(self):
        """Compute the circumference of a circle with radius r."""
        return 2.0 * np.pi * self.r
    def dist(self):
        """Compute the distance to the origin."""
        return np.abs(np.sqrt(self.x**2 + self.y**2) - self.r)
    
    def dist_between(self, other):
        """
        Compute the distance between this circle and another circle.
        
        Parameters
        ----------
        other : Circle
            the other circle.
        """
        if not isinstance(other, Circle):
            raise Exception("other must be a Circle!!!")
    
    def translate(self, Δx, Δy):
        """Move the circle by (Δx, Δy)"""
        self.x += Δx
        self.y += Δy
        return self # This is not needed, but is sometimes convenient.
        
    def __str__(self):
        return "A Circle at (%.1f, %.1f) with radius %.1f." % (self.x, self.y, self.r)

In [48]:
circle1 = Circle(2.0, 3.0, 1.0)

In [49]:
print(circle1)

A Circle at (2.0, 3.0) with radius 1.0.


## Python `import` (10 min)

- It is often useful to collect a bunch of classes and functions into **modules** or **packages** ([Python package documentation](https://docs.python.org/3/tutorial/modules.html#packages)).
  - For example, numpy is a package that contains both classes (e.g. `np.ndarray`) and functions (e.g. `np.sqrt`) and even constants (e.g. `np.pi`).
- You will discuss packages in depth in other courses.
- For now, we'll just discuss importing packages.
- Unfortunately, this is a bit confusing.

#### Ways of importing things

Let's use `numpy` as an example, and import it in various ways.


Import a package:

In [None]:
import numpy

In [None]:
# using the function sqrt that the package numpy is providing
# sqrt is not a method of a particular class because numpy is actually a package
numpy.sqrt(5) 

Import a package, but refer to it by a different name:

In [None]:
import numpy as np

In [None]:
np.sqrt(5)

Import a particular function from a package:

In [None]:
from numpy.random import randn

In [None]:
randn() # now I can refer to it without the package/module names
# np.random.randn() This is was the code before

In [None]:
from numpy.random import randn as random_gaussian

In [None]:
random_gaussian()

It's also possible to import everything in a module, though this is generally not recommended:

In [None]:
from numpy.random import *

In [None]:
binomial(10, 0.1) #instead of numpy.random.binomial(10, 0.1)

#### Some annoying facts of life

The module and the function might have the same name:

In [None]:
import random

In [None]:
random.random()

In [None]:
from random import random

In [None]:
random()

Sometimes you may need to explicitly import submodules to use them:

In [None]:
import scipy

In [None]:
scipy.stats

In [None]:
import scipy.stats

In [None]:
scipy.stats

In Python, the import name and the install name do not necessarily match:

In [None]:
import sklearn

To install, run `pip install scikit-learn`. or `conda install -c anaconda scikit-learn`

#### `dir`

You can use `dir` to look up all the things we can do to that object:

In [50]:
dir(circle1)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'area',
 'circumference',
 'dist',
 'dist_between',
 'r',
 'translate',
 'x',
 'y']

## Importing your own functions (5 min)

- In many MDS courses we only work in Jupyter - it is a great teaching & learning environment.
- However, when we write larger pieces of code we will need to move to `.py` files. 
- Let's restart the kernel so that `Circle` is no longer in the environment.

In [8]:
circle = Circle(1,2,3)

- Luckily, I have a file in this directory named `circle.py` - let's take a look.

In [7]:
from circle import Circle

In [10]:
c = Circle(1,2,3)

In [4]:
my_function()

NameError: name 'my_function' is not defined

In [11]:
from circle import *

In [12]:
my_function()

In [13]:
MY_CONSTANT

5

- We imported not only a class, but also a function and a single variable.
- It makes sense that we can import all of these, because they are all objects in Python, just with different types:

In [14]:
type(Circle)

type

In [15]:
type(my_function)

function

In [16]:
type(MY_CONSTANT)

int

And `c` itself has a type that we defined:

In [None]:
type(c)

In [2]:
import numpy as np

In [20]:
#What is the value of y?
x = 1
y = x
x = 2
y
print(id(x))
print(id(y))

140392900362512
140392900362480


And how about the next one?

In [22]:
x = [1]
y = x
x[0] = 2
y
print(id(x))
print(id(y))

[2]
[2]
140392554552704
140392554552704


## References


- In Python, a list `x` is a **reference** to some location in the computer's memory. 
- If you set `y = x` these two variables now refer to the same location in memory - the one that `x` referred to.
- Setting `x[0] = 2` goes and modifies that memory. So `x` and `y` are both modified.


- However, some basic built-in types `int`, `float`, `bool` etc are _exceptions_ to this logic:
  - If you set `y = x` it actually copies the value `1`, so `x` and `y` are decoupled.
  - This is a "special" case. 
  
- Analogy:
  - I share a Dropbox folder (or git repo) with you, and you modify it -- I sent you _the location of the stuff_ (this is like the list case)
  - I send you an email with a file attached, you download it and modify the file -- I sent you _the stuff itself_ (this is like the integer case)



And this?

In [1]:
x = [1]
y = x
x = [2] # before we had x[0] = 2
y

[1]

<br><br><br>
No, here we are not modifying the contents of `x`, we are setting `x` to refer to a new list `[2]`.

#### Additional weirdness

In [3]:
x = np.array([1,2,3,4,5])
y = x
x = x + 5
y

array([1, 2, 3, 4, 5])

In [4]:
x = np.array([1,2,3,4,5])
y = x
x += 5
y

array([ 6,  7,  8,  9, 10])

So, it turns out `x += 5` is not identical `x = x + 5`.

- The former modifies the contents of `x`.
- The latter first evaluates `x + 5` to a new array of the same size, and then overwrites the name `x` with a reference to this new array.

## Function calls and references

How about these?

In [8]:
def foo(y):
    y = "Hello from inside foo!"
    return y

x = "I'm outside."
foo(x)
x

"I'm outside."

In [6]:
def bar(y):
    y[0] = "Hello from inside foo!"
x = ["I'm outside."]
bar(x)
x

['Hello from inside foo!']

- Above: the fact that you called a function is not relevant.
- When pass the value of `x` into the function and it becomes `y` in the function, that is basically like `y = x` we had above.
- In the latter case, we say the function has a [side effect](https://en.wikipedia.org/wiki/Side_effect_(computer_science)).

In [9]:
x = "I'm outside."
x = foo(x)
x

'Hello from inside foo!'

- Above: in this case, `x` is not getting modified inside `foo`.
- Rather it's getting overwritten after the function call.

- (Optional) If you're interested, there is a bunch of terminology you can look up
  - pass by value (call by value)
  - pass by reference (call by reference)
  - copy-on-modify
  - lazy copying
  - ...


- Good news: the we don't need to memorize special rules for calling functions. 
- Copying happens with `int`, `float`, `bool`, probably some other things I'm forgetting; the rest is "by reference"
- now you see why we care if objects are mutable or immutable... passing around a reference can be dangerous!
- **General rule**: if you do `x = ...` then you're not modifying the original, but if you do `x.SOMETHING = y` or `x[SOMETHING] = y` or `x *= y` then you probably are.

Note: In R, life is simpler - means you're never "modifying the original" inside a function.

## `copy` and `deepcopy` 

In [2]:
import copy

x = [1]
y = x
x[0] = 2
y

[2]

In [12]:
x = [1]
y = copy.copy(x)
x[0] = 2
y

[1]

Ok, so what do you think will happen here?

In [3]:
x = [[1], [2,99], [3, "hi"]] # a list of lists

y = copy.copy(x) 

x[0][0] = "pikachu"
print(x)
print(y)

[['pikachu'], [2, 99], [3, 'hi']]
[['pikachu'], [2, 99], [3, 'hi']]


<br><br><br>
What happened? 

- `copy` makes the _containers_ different, i.e. the outer list. 
- But the outer lists both point to the same data.
- This is what happens after `y = copy.copy(x)`:

![](listCopySmall.jpg)

We can use `is` to tell apart these scenarios.

In [4]:
x == y       # they are both lists of the same lists

True

In [10]:
x is y   # but they are not the *same* lists of that stuff

False

So, by that logic...

In [42]:
y.append(5)
print(x)
print(y)

[['pikachu'], [2, 99], [3, 'hi']]
[['pikachu'], [2, 99], [3, 'hi'], 5]


In [43]:
x == y

False

<br><br><br>
That makes sense, as weird as it seems. 

- In short, `copy` copies one level down.
- What if we want to copy everything?
- Enter our friend `deepcopy`:

In [44]:
x = [[1], [2,99], [3, "hi"]] 

y = copy.deepcopy(x)

x[0][0] = "pikachu"
print(x)
print(y)

[['pikachu'], [2, 99], [3, 'hi']]
[[1], [2, 99], [3, 'hi']]


## Scoping


In [45]:
def f():
    x = 10

x = 5
f()

In [None]:
def f():
    new_variable = 10

f()
new_variable

- It looks like the `x` inside and outside the function are different.
- It looks like `new_variable` is defined only for use inside the function.
- That is generally a good way of thinking, and is more true in other languages.
- This is called **scope** (see [Wikipedia article](https://en.wikipedia.org/wiki/Scope_(computer_science))).
- However, in Python things are dangerously loose and permissive, so **be careful**.

In [46]:
def bat():
    print(s)
    
s = "hello world"
bat()

hello world


In [47]:
def bat(s):
    print(s)
    
s = "hello world"    
bat("another string")

another string


What happened? 

- In the first case, `s` was not defined, so it was borrowed from the scope outside the function.
- In the second case, `s` was passed in directly, so it was used.
- This is very worrying, because of the following:

In [48]:
def modify_the_stuff():
    the_stuff[0] = 99999
    
the_stuff = [1,2,3]
modify_the_stuff()
the_stuff

[99999, 2, 3]

- Above: `modify_the_stuff` modified a variable that was not even passed in as an argument!
- So functions can really mess with your stuff without you knowing. 
- Please do not write code like this!
  - Safest: functions with no side effects.
  - Acceptable: functions with side effects, clearly documented.
  - Disaster: functions with undocumented side effects on its arguments.
  - Complete disaster: functions modifying stuff that you didn't even pass into the function.

Some other things to avoid:

In [None]:
def func(s, len):
    print(len(s))
    
func("hello", 5)

- Above: don't do this - inside the function there's a variable called `len` which is overwriting the built-in `len` function.
- Below: functions can access other functions if they are all in the global scope:

In [None]:
def f():
    print("Hello from f!")
    
def g():
    f()
    
g()

That is, there's no need to pass the function `f` into `g` to call it, because `f` is "global".

## Python File Input/Output

Open a file for reading: ```infile = open("input.txt", "r") ```

Open a file for writing: ```outfile = open("output.txt", "w")```

Open a file for read/write: ```myfile = open("data.txt", "r+")```


In [16]:
infile = open("data.txt", "r")

for line in infile:
    #print (line)
    print (line.strip('\n'))   
    
#infile.close()

She should have died hereafter.
There would have been a time for such a word.
Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day
To the last syllable of recorded time,
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player
That struts and frets his hour upon the stage
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.


In [17]:
# To read all file as one string we use .read(). This will read everything including end of line characters.
infile = open("data.txt", "r")
text_string=infile.read()
print(text_string)

She should have died hereafter.
There would have been a time for such a word.
Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day
To the last syllable of recorded time,
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player
That struts and frets his hour upon the stage
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.


In [18]:
infile = open("data.txt", "r")
print(infile.closed)	# False


False


In [19]:
lines = infile.readlines() #Read all lines in the file into a list
lines

['She should have died hereafter.\n',
 'There would have been a time for such a word.\n',
 'Tomorrow, and tomorrow, and tomorrow,\n',
 'Creeps in this petty pace from day to day\n',
 'To the last syllable of recorded time,\n',
 'And all our yesterdays have lighted fools\n',
 'The way to dusty death. Out, out, brief candle!\n',
 "Life's but a walking shadow, a poor player\n",
 'That struts and frets his hour upon the stage\n',
 'And then is heard no more. It is a tale\n',
 'Told by an idiot, full of sound and fury,\n',
 'Signifying nothing.']

In [20]:
infile.close()
print(infile.closed)	# True

True


In [None]:
#Writing to a Text File
outfile = open("output.txt", "w")

for n in range(1,11):
    outfile.write(str(n) + "\n")

outfile.close()


#### Processing a CSV FIle
- Using split
- Import csv
- Import pandas 

In [55]:
with open("state_property_vote.csv", "r") as infile:
    for line in infile:
        line = line.strip(" \n")
        fields = line.split(",") 
        for i in range(0,len(fields)):
            fields[i] = fields[i].strip()
        print(fields)

['state', 'pop', 'med_prop_val', 'med_income', 'avg_commute']
['Montana', '1042520', '217200', '46608', '16.35']
['Alabama', '4863300', '136200', '42917', '23.78']
['Arizona', '6931071', '205900', '50036', '23.69']
['Arkansas', '2988248', '123300', '41335', '20.49']
['California', '39250017', '477500', '61927', '27.67']
['Colorado', '5540545', '314200', '61324', '23.02']
['Connecticut', '3576452', '274600', '70007', '24.92']
['Delaware', '952065', '243400', '59853', '24.97']
['District of Columbia', '681170', '576100', '75506', '28.96']
['Florida', '20612439', '197700', '47439', '25.8']
['Georgia', '10310371', '166800', '49240', '26.91']
['Hawaii', '1428557', '592000', '69549', '26.03']
['Idaho', '1683140', '189400', '47572', '19.71']
['Illinois', '12801539', '186500', '57458', '27.49']
['Indiana', '6633053', '134800', '49384', '22.66']
['Iowa', '3134693', '142300', '53816', '18.11']
['Kansas', '2907289', '144900', '52392', '18.52']
['Kentucky', '4436974', '135600', '42914', '22.4']
['

In [62]:
import csv

with open("state_property_vote.csv", "r") as infile:
	csvfile = csv.reader(infile)
	for row in csvfile:
		print (row[0])


state
Montana
Alabama
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Alaska
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Puerto Rico


#### Try it: Python Files

Write a Python program that writes to the file test.txt the numbers from 20 to 10 in descending order. Then, write another program that reads your newly created test.txt file line by line and only prints out the value if it is even.


In [None]:
## YOUR TURN TO PRACTICE


In [64]:
import pandas as pd

## What is pandas
- This tool is essentially your data's home. Using pandas, we explore our datasets. 
- We use pandas for cleaning, transforming, and analyzing the data.
- Panda extracts the data from your dataset (CSV) and transforms it into a DataFrame.
- Then, you can:
    - Calculate statistics and answer questions about the data (average, meadian, max, min)
    - Clean the data by removing missing values
    - Visualize the data
    - Store the cleaned data


#### How does pandas fit into the data science toolkit?
- Pandas library is normally used in combination with other libraries for data science.
- Pandas is built on top of the NumPy package. So, a lot of the structure of NumPy is used or replicated in Pandas.
- The core components of pandas are teh ```Series``` and ```Dataframe```
- A ```Serie``` is essentially a column, and a ```DataFrame``` is a multi-dimensional table made up of a collection of Series.

In [70]:
# Create a dataframe from a CSV file

dataframe= pd.read_csv("state_property_vote.csv",)

                   state       pop  med_prop_val  med_income  avg_commute
0                Montana   1042520        217200       46608        16.35
1                Alabama   4863300        136200       42917        23.78
2                Arizona   6931071        205900       50036        23.69
3               Arkansas   2988248        123300       41335        20.49
4             California  39250017        477500       61927        27.67
5               Colorado   5540545        314200       61324        23.02
6            Connecticut   3576452        274600       70007        24.92
7               Delaware    952065        243400       59853        24.97
8   District of Columbia    681170        576100       75506        28.96
9                Florida  20612439        197700       47439        25.80
10               Georgia  10310371        166800       49240        26.91
11                Hawaii   1428557        592000       69549        26.03
12                 Idaho   1683140    

In [66]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   state         52 non-null     object 
 1   pop           52 non-null     int64  
 2   med_prop_val  52 non-null     int64  
 3   med_income    52 non-null     int64  
 4   avg_commute   52 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 2.2+ KB


In [72]:
# Viewing the data with .head()
dataframe.head(3)

Unnamed: 0,state,pop,med_prop_val,med_income,avg_commute
0,Montana,1042520,217200,46608,16.35
1,Alabama,4863300,136200,42917,23.78
2,Arizona,6931071,205900,50036,23.69


In [68]:
# Viewing the last rows of our data with .tail()
dataframe.tail(3)

Unnamed: 0,state,pop,med_prop_val,med_income,avg_commute
49,Wisconsin,5778709,173200,52632,20.89
50,Wyoming,585501,209500,58291,15.94
51,Puerto Rico,3411307,111900,20078,28.36


In [73]:
#By default, the dropna() method returns a new DataFrame, and will not change the original.

new_df = dataframe.dropna()
new_df

Unnamed: 0,state,pop,med_prop_val,med_income,avg_commute
0,Montana,1042520,217200,46608,16.35
1,Alabama,4863300,136200,42917,23.78
2,Arizona,6931071,205900,50036,23.69
3,Arkansas,2988248,123300,41335,20.49
4,California,39250017,477500,61927,27.67
5,Colorado,5540545,314200,61324,23.02
6,Connecticut,3576452,274600,70007,24.92
7,Delaware,952065,243400,59853,24.97
8,District of Columbia,681170,576100,75506,28.96
9,Florida,20612439,197700,47439,25.8


In [None]:
#Remove all rows with NULL values:
new_df.dropna(inplace = True)
new_df

#Replace NULL values with the number 1:
#new_df.fillna(1, inplace = True)
#new_df["med_income"].fillna(1, inplace = True)


#x = new_df["med_income"].mean()
#new_df["med_income"].fillna(x, inplace = True)

In [None]:
from datetime import datetime

# Get current date
datetime.now()

In [None]:
from datetime import date

# Creating a datetime.date type
date(2022,9,21)


In [None]:
# Convert row to Convert to date:
# df['Date'] = pd.to_datetime(df['Date'])
