# Understanding Students' Career Aspirations 

A survey has been conducted to understand career aspirations of students. It gathered responses from more than 1400 students but all these responses are misspelt. But we do know that all these responses are either of the following 10 career choices.
1. Doctor
2. Lawyer
3. Teacher
4. Engineer
5. Accountant
6. Nurse
7. Police
8. Architect
9. Dentist
10. Pharmacist

We shall use a program written in python to correct these responses and analyse the data.

Let us first import the excel file that contains these responses using the pandas library.

In [1]:
import numpy 
import pandas as pd

df=pd.read_excel(r"C:\Users\A.MANIDEEP REDDY\Desktop\TaskData1.xlsx",header=None)

Let us now check the dataframe by printing it.

In [2]:
print(df)

               0
0         cokter
1      dentiists
2        Enginir
3         PoLICE
4      engneiear
...          ...
1401     enginir
1402  aCcountANT
1403  accountant
1404      POlIcE
1405    enjineer

[1406 rows x 1 columns]


We have 1406 responses that are misspelt. We shall now fix them by using Levenshtein Algorithm

### Levenshtein Algorithm

The Levenshtein Algorithm works by calculating the minimum number of edits (also called the Levenshtein distance) that are needed to be done on the given word to reach the required word. The "edits" can either be

1. Deletion of a letter
2. Insertion of a letter
3. Replacement of a letter



A simple version of this algorithm is 

In [3]:
def levenshtein(a, b):
    if not a:
        return len(b)
    if not b:
        return len(a)

    if a[0] == b[0]:
        return levenshtein(a[1:], b[1:])
    
    return 1 + min(
        levenshtein(a[1:], b),
        levenshtein(a, b[1:]),
        levenshtein(a[1:], b[1:])
    )

This function calculates the Levenshtein distance between 2 strings names a and b. Let us check it with an example. 

In [4]:
a='cokter'
b='doctor'
levenshtein(a,b)

3

The function says that the word cokter requires only 3 edits to correct its spelling to doctor,which is true.

cokter --> dokter --> docter --> doctor

### Approach and logic behind the code

Now,we know that all the 1406 responses are misspelt forms of just 10 words. So,we shall find the Levenshtein distance of each response from all these 10 words,the one with the least Levenshtein distance is the correct word. We do this by creating a series of the correctly spelt words.

In [5]:
correctspellings=pd.Series(['Doctor','Lawyer','Teacher','Engineer','Accountant','Nurse','Police','Architect','Dentist','Pharmacist'])
#Series containing the correct spellings. This will be used as reference for the Levenshtein function.

One thing we need to think about is that the above function considers changing of a upper case letter to its lower case version (and vice versa) as an edit,but we do not want this. To avoid this we convert the response and the correctly spelt words into their lower case version while finding the Levenshtein distance.
We shall also create a series that contains the composition of career aspirations amonng the students.

Now we run a iterate over the elements of df (the dataframe containing the responses),use the above logic to find the corrected version of it from the 'correctspellings' series and replace the response with this word.

In [None]:
values=pd.Series([0,0,0,0,0,0,0,0,0,0],index=correctspellings)  #Series containing the composition of the career choices among the students.The indices are the career choices and are initialized to 0.

for j in range(len(df)):  #This outer loop iterates over the 1400+ responses
    x = 100
    stri = df.iloc[j, 0]  
    for i in correctspellings:     #This inner loop iterates over the series containing the correct spellings
        y = levenshtein(i.lower(), stri.lower())
        if y < x:
            x = y
            corrected_spelling = i        #At the end of this inner loop,the corrected_spelling variable is assigned with correct spelling of the response
    df.iloc[j,0]=corrected_spelling      #This line is to replace the response with its corrected version  
    values.loc[corrected_spelling]+=1    #This line is to update the count of the career choice

This loop takes around 10 minutes to iterate all over the 1406 responses. Let us now check the dataframe now.

In [None]:
df=df.rename(columns={0:'Career Choices'}) #Giving the dataframe header
print(df)

The dataframe has now been edited. But the excel file still contains the misspelt responses. We shall edit the excel file by rewriting it with the dataframe.

In [None]:
df.to_excel(r"C:\Users\A.MANIDEEP REDDY\Desktop\TaskData1.xlsx", index=False)

### Bar Graph Representation of the Data

The excel file has been updated with the correct spellings. Also the series names values is also updated and now contains the required data to plot a bar graph.
    
We shall now plot the bar graph

In [None]:
values.plot.bar()

import matplotlib.pyplot as plt

plt.xlabel("Career Choice")
plt.ylabel("No.of Students")
plt.show()

From the bar graph we observe that engineer is most students' career choice and accountant is least preferred. For more precise analysis,we can print the series 'values'

In [None]:
print(values)