Python lessons during the DataCamp course.
Matplotlib - is a packet for data visualization. We will be using its sub-packet named pyplot.
import matplotlib.pyplot as plt
To plot some information, we use the plt.plot, as followed:
year = [1950, 1970, 1991, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop) # Line plot, First argument = x-axis, second argument = y-axis
Python will only plot the graphic when the show method is called.
plt.show() # Displays the plot
O gráfico de dispersĂŁo Ă© uma ferramenta gráfica amplamente utilizada em estatĂstica e outras áreas do conhecimento para visualizar a relação entre duas variáveis quantitativas. Ao plotar os pontos no gráfico, Ă© possĂvel detectar padrões e tendĂŞncias nos dados e identificar possĂveis correlações entre as variáveis. Fonte FM2S.
plt.scatter(year, pop) # It will plot the dots without the line connecting it
plt.show()
Tip: Python For Data Science Cheat Sheet.
To change the x-axis to a logarithmic scale, just plt.xscale('log')
.
Next code gives colors to the dots while using a graph such as scatter:
dict = {
'Asia':'red',
'Europe':'green',
'Africa':'blue',
'Americas':'yellow',
'Oceania':'black'
}
# Specify c and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c=col)
# I could also add the parameter alpha=0.8 to change opacity of the bubbles
I can also add other customizations such as:
# Additional customizations
plt.text(1550, 71, 'India') # x-axis location, y-axis location, text
plt.text(5700, 80, 'China')
# Add grid() call
plt.grid(true)
Helps to get idea about distribution. X-axis are for the bins and it defines the number of equal-width bins in the range. Bins has a default value of 10.
To get help: help(plt.hist)
.
import matplotlib.pyplot as plt
values = [0, 0.6, 1.4, 1.6, 2.2, 2.5, 2.6, 3.2, 3.5, 3.9, 4.2, 6]
plt.hist(value, bins=3) # Bins will be 0, 3, 6.
plt.show()
- To show the bins on Y-axis use
?
. plt.clf()
Clean the current figure.
plt.plot(x_axis, y_axis)
plt.xlabel(´X-axis Label`)
plt.ylabel(´Y-axis Label`)
plt.title(´Title of the Plot`)
plt.yticks([0, 2, 4, 6, 8, 10]) # Gives the values that will appear on Y-axis
plt.yticks([0, 2, 4, 6, 8, 10],
[´0´, ´2B´, ´4B´, ´6B´, ´8B´, ´10B´]) # This way, it will give the names fot the Y-axis
plt.show()
dict_name = { "key_1":1, "key_2",2 } # Key value pairs
dict_name["key_1"]
print(dict_name.keys()) # To print out all the key values
dict_name["key_3"] = 3 # To add more data to a dictionary
"key_3" in dict_name # To check if a key exists inside the list
dict_name["key_3"] = 4 # To change a value
del(dict_name["key_3"]) # To delete a key value pair
Dictionaries can contain other disctionaries inside (values are again dictionaries).
# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
'france': { 'capital':'paris', 'population':66.03 },
'germany': { 'capital':'berlin', 'population':80.62 },
'norway': { 'capital':'oslo', 'population':5.084 } }
print(europe['spain']['population']) # to show the population of spain
To add more data (as sub-dictionaries) on the europe dictionary, just:
# Create sub-dictionary data
data = {'capital':'rome','population':59.83}
# Add data to europe under key 'italy'
europe['italy'] = data
Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. To use Pandas, we have to import its package first.
import pandas as pd
Pandas has advantages over Numpy, once Pandas handles better different data types in a single variable. It creates a DataFrame.
We can create a DataFrame from a Dictionary.
dict_name = {
"Column1":["Value10", "Value01", "Value02", "Value03"],
"Column2":["Value20", "Value21", "Value22", "Value23"],
"Column3":[30, 31, 32, 33] }
DataFrame_name = pd.DataFrame(dict_name)
Pandas automatict assigns row names (0 and forward), we can change that by adding index name to it:
DataFrame_name.index = ["NameRow1", "NameRow2", "NameRow3", "NameRow4", "NameRow5"]
We can read csv files using that!
DataFrame_name = pd.read_csv("path/to/file_name.csv")
Now, when we have index names on our csv file, we need to pass the index_col
argument. Let's do it:
DataFrame_name = pd.read_csv("path/to/file_name.csv", index_col=0)
Now let's index and select data.
To print the whole column:
print(DataFrame_name["Column2"])
But this, will return a dtype name object. If I type type(DataFrame_name["Column2"]
it will retur as pandas.core.series.Series
.
To return the data and keep the data in a DataFrame format, we need to use double square brackets.
print(DataFrame_name[["Column2"]])
Then when we check its type by type(DataFrame_name["Column2"]
it will retur pandas.core.series.DataFrame
.
I can then use more columns by:
print(DataFrame_name[["Column2", "Column3"]])
I can select the rows by using the index.
print(DataFrame_name[1:3]) # index starts in 0, so we will print rows 2 to 4
In pandas we have:
loc
as label-based.iloc
as integer position-based.
To use get a row using index with names:
DataFrame_name.loc[["NameRow5"]] # Two brackets to retur DataFrame format
DataFrame_name.iloc[[4]] # Same result but using iloc
Now to get only a few rows and columns:
DataFrame_name.loc[["NameRow4", "NameRow5"], ["Column1", "Column2"]]
DataFrame_name.iloc[[3, 4], [0, 1]] # Same result but using iloc
To get all rows I could simply use:
DataFrame_name.loc[:, ["Column1", "Column2"]]
DataFrame_name.iloc[:, [0, 1] # Same result but using iloc
Same as C, ==, >=, <=, !=.
DataFrame_name["area"] # Get pandas series to analize
is_huge = DataFrame_name["area"] > 8 # Compare each value from the serie to >8
# To show info about itens that are greate than 8, just
print(DataFrame_name[is_huge]
# Or, I can simply:
print(DataFrame_name[DataFrame_name["area"] > 8]])
Comparators works same way as in C.
x > 5 and x < 15
for AND operator.logical_and(x > 5, x < 15)
NumPy equivalent for AND.y > 1 or y < -5
for OR operator.logical_or(y > 1, y < -5)
NumPy equivalent for OR.not False
for NOT operator.logical_not(False)
NumPy equivalent for NOT.
Note: Once Pandas is created based on NumPy packet, it is possible to use the boolean operators also.
np.logical_and(DataFrame_name["area"] > 8], DataFrame_name["area"] < 100])
DataFrame_name[np.logical_and(DataFrame_name["area"] > 8], DataFrame_name["area"] < 100])]
For strings, it gets according to alphabel: 'rafaela' < 'amanda' ? The answer is False.
if condition :
expression
# to continue the if, just use 4 spaces or tab
z = 4
if z % 2 == 0 : # True
print("z is even")
if condition :
expression
else :
expression
z = 5
if z % 2 == 0 : # True
print("z is even")
else :
print("z is odd")
if condition :
expression
elif condition :
expression
else :
expression
while condition :
espression
error = 5
while error > 1 :
error = error / 4
print(error)
for var in seq :
expression
for index_name in list_name :
print(index_name) # Not list_name[index_name], prints content of index
for index, index_name in enumerate(list_name) :
print ("index " + str(index) + ": " + str(index_name)) # index 0: value
# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Change for loop to use enumerate() and update print()
for a, area in enumerate(areas) :
print("room " + str(a) + ": " + str(area))
dict = { "key1":1,
"key2":2,
"key3":3 }
for key, value in dict.items() : # it could be k, v
print(key + " - " + str(value)) # it doen't follow any order
value = np.array([array1, array2])
for val in value :
print(val)
for val in np.nditer(value)
print(Val) # each value in each line, all the array1 and then array2
DataFrame_name = pd.read_csv("csv_name.csv", index_col = 0)
for label, row in DataFrame_name.iterrows():
print(label)
print(row)
for label, row in DataFrame_name.iterrows():
print(label + ": " + row["column_name"])
for label, row in DataFrame_name.iterrows():
DataFrame_name.loc[label, "Lenght"] = len(row["column_name"])
DataFrame_name.loc[lab, "COLUMN_NAME"] = row["column_name"].upper()
DataFrame_name["Lenght"] = DataFrame_name["column_name"].apply(len) # returns the same as before
DataFrame_name["COLUMN_NAME"] = DataFrame_name["column_name"].apply(str.upper)
The for loop does not require an indexing variable to set beforehand.
fruits = ["apple", "banana", "cherry"]
for x in fruits:
print(x)
if x == "banana":
break
Looping Through a String:
for x in "banana":
print(x)
Do not print banana:
fruits = ["apple", "banana", "cherry"]
for x in fruits:
if x == "banana":
continue
print(x)
The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and ends at a specified number.
for x in range(6): # Goes from 0 to 6, with 1 of increment
print(x)
for x in range(2, 6): # Goes from 2 to 6, with 1 of increment
print(x)
for x in range(2, 30, 3): # Goes from 2 to 30, with 3 of increment
print(x)
fot x in range(4):
pass # for loops cannot be empty, the pass statement to avoid getting an error
NOTE that for using the random()
method several times I had to use [ expression for _ in range(100) ]
.
random = [random.random() for _ in range(100)] # Returns a random float number between 0 and 1.
Now to use it, and make it run over all the elements in the array:
# Calculate logarithm for each element in the array
log_R1 = [math.log(x) for x in R1] # R1 is a random variable with uniform distribution
log_R1 returns:
There are a lot of methods to use within the random packet. Here are listed 3 that called my attention:
randrange()
# Returns a random number between the given range.randint()
# Returns a random number between the given range -1.random()
# Returns a random float number between 0 and 1.
import random
values = np.random.randint(-10,10,1000) # Lower limit, Upper limit, number of values : <class 'numpy.ndarray'>
print('Mean: ' + str(np.float64(np.mean(values)))) # It returns a <numpy.float64> type, then is converted to string
print('Standard deviation: ' + str(np.float64(np.std(values)))) # Same as before
## or to generate 100 random numbers - random() doesn't has param: number of values
random = [random.random() for _ in range(100)] # Returns a random float number between 0 and 1.
More about random
np.random.rand() # Pseudo-random numbers between 0 and 1
np.random.seed(123) # Gives a seed for the random generator
outcomes = [] # Initialize an empty list
for x in range(10) :
value = np.random.randint(0,10)
outcomes.append(value)
tails = [0]
for x in range(10) :
coin = np.random.randint(0,2)
tails.append(tails[x] + coin)
# Replace below: use max to make sure step can't go below 0
step = max(0, step -1)