<a href="https://colab.research.google.com/github/ColeBromfield01/bromfield-portfolio/blob/main/Homework_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem 1

Just as $P(H | E_{1}) = \frac{P(E_{1} | H)P(H)}{P(E_{1})}$:
Our more informed posterior can take the form:

$P(H | E_{1}, E_{2}) = \frac{P(E_{2} | H, E_{1})P(H | E_{1})}{P(E_{2} | E_{1})}$

Since $E_{1}$ and $E_{2}$ are independent,

$P(E_{2}|H, E_{1}) = P(E_{2}|H)$

and

$P(E_{2}|E_{1}) = P(E_{2})$

Therefore, our posterior can be rewritten as:

$P(H | E_{1}, E_{2}) = \frac{P(E_{2} | H)P(H | E_{1})}{P(E_{2})}$

And thus, our X and Y values are:

$X = P(E_{2}|H)$

$Y = P(E_{2})$

*Time spent: ~20 minutes*

# Problem 2
## Part A
### [1,1]
With these parameters, we get a uniform distribution.  In a practical sense, we get an expected value for the probability of heads, but our confidence level is low, because all probabilities have equal likelihood.

### [2,2]
Once again, we get a symmetrical distribution, but no longer are all probabilities equally likely.  Now, more balanced values (closer to 50-50) have a higher likelihood, while fringe values (very high or very low probability of success) are less likely.

### [10,10]
The distribution is again centered, the strength of belief further increases.  Now, probabilities with any semblance of imbalance (outside of the 0.4-0.6 range) are highly unlikely.

### [2,10]
We now see a skewed distribution, with a low expected probability--probabilities above 0.5 approach a non-zero likelihood (suggesting a high strength of belief).

### [10,2]
This is simply the inverse of the [2,10] distribution--the expected probability is high, and probabilities below 0.5 are highly unlikely.

## Part B
I already knew the formula: divide alpha (the first parameter) by the sum of alpha and beta (the second parameter).  For example, for beta(10, 2), the expected probability is 10/(10+2) = 5/6.  Any pair of values with the same ratio will therefore produce the same expected probability, only differing in strength of belief (variance).

*Time spent: ~ 15 minutes*

# Problem 3
## Part A
If we use weights of 3 and a bias term of -4, the output will be correct:
* $y = $&sigma;$(3x_{1} + 3x_{2} - 4)$

If $x_{1}$ and $x_{2}$ are both 0:

$y = $&sigma;$(3*0 + 3*0 - 4) = $&sigma;$(-4) = \frac{1}{1 + e^4} = 0.0180  $&asymp;$  0$

If $x_{1}$ or $x_{2}$ (but not both) is equal to 1:

$y = $&sigma;$(3*1 + 3*0 - 4) = $&sigma;$(-1) = \frac{1}{1 + e} = 0.2689  $&asymp;$  0$

If $x_{1}$ and $x_{2}$ are both 1:

$y = $&sigma;$(3*1 + 3*1 - 4) = $&sigma;$(2) = \frac{1}{1 + e^{-2}} = 0.8808  $&asymp;$  1$

## Part B
The problem here is a lack of linear separability.  With XOR, we have (0, 0) and (1,1) classified under label 0, and (0, 1) and (1, 0) classified under label 1.  Non-linear decision boundaries are not possible with a single-layer neural network.

## Part C
With a two-neuron hidden layer, each neuron can create a linear boundary--combined, these boundaries allow the network to produce the correct XOR output.

## Part D
The sigmoid function is a more effective activation function because of its continuity.  With the step function being discontinuous, it would prove problematic when training deep neural networks with gradient descent (since discontinuous functions are also non-differentiable).

*Time spent: ~20 minutes*

# Problem 4
## Part A
Using Euclidean distance as a classifier with a 10,000-dimensional feature vector presents a high risk of encountering the curse of dimensionality--as the number of dimensions increases, the distances between points become increasingly uniform, resulting in a greater degree of ambiguity in classifying points based on their nearest neighbors.

## Part B
Principal component analysis (PCA) transforms the data such that dimensions are ordered by explained variance.  Removing lower-variance dimensions allows the user to mitigate the curse of dimensionality.  PCA is particularly useful in cases where a significant portion of the variance is explained by a small number of dimensions--for example, if the number of dimensions in the data can be reduced from 10,000 dimensions to 10 while retaining 95% of the variance, the accuracy of the kNN classifier would likely increase significantly.

*Time spent: ~5 minutes*

# Problem 5

In [11]:
from bs4 import BeautifulSoup
import re
from collections import Counter

# Scraping HTML code from file
with open('Assignment0-converted.html', 'r', encoding='utf-8') as myfile:
  html_content = myfile.read()

# Parsing text from HTML code
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text()

# Replacing non-alphabetical characters with whitespace
pattern = '[^a-zA-Z]'
cleaned_text = re.sub(pattern, ' ', text)

# Converting to lowercase
cleaned_text = cleaned_text.lower()

# Splitting into a list of words
words = cleaned_text.split()

# Getting the 20 most common words in the list
word_counts = Counter(words)
most_common_words = word_counts.most_common(20)

for word, count in most_common_words:
  print(f"{count} {word}")

65 the
29 a
28 and
27 in
23 to
22 you
22 of
19 this
18 is
17 x
17 that
16 e
15 question
14 on
14 for
12 as
11 distribution
10 it
10 t
9 please


*Time spent: ~25 minutes*