# INTRODUCTION

 In this note book we are working on the dataset named, **"Pakistan Intellectual Capital"**. This dataset contains list of computer science/IT professors from **89** different universities of **Pakistan**.

**Variables** in the dataset are Serial No, Teacher’s Name, University Currently Teaching, Department, Province University Located, Designation, Terminal Degree, Graduated from (university for professor), Country of graduation, Year, Area of Specialization/Research Interests, and some Other Information.

Let's begin...

# IMPORT REQUIRED LIBRARIES

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# IMPORT DATA

In [None]:
original_df = pd.read_csv('../input/pakistanintellectualcapitalcs/Pakistan Intellectual Capital - Computer Science - Ver 1.csv', encoding = "ISO-8859-1")
copy1_df = original_df.copy()
copy1_df.head()

In [None]:
# Renaming columns names for convinience
copy1_df = copy1_df.rename(columns={
    "Teacher Name": "teacher_name",
    "University Currently Teaching": "current_university",
    "Department": "department",
    "Province University Located": "province",
    "Designation": "designation",
    "Terminal Degree": "degree",
    "Graduated from": "graduated_from",
    "Country": "country",
    "Year": "year",
    "Area of Specialization/Research Interests": "specialization",
    "Other Information": "other_information"
})
copy1_df.head()  

In [None]:
copy1_df.shape

In [None]:
# This provides information about the data
copy1_df.info()

Here, it's been figured out that designation, degree, graduated_from, country, year, specialization, and other_information has **NULL** values.

In [None]:
# Droping extra data columns like S#, other information
copy1_df.drop(["S#", "other_information"], axis=1, inplace= True)

In [None]:
# this shows the number of null values present in the data
copy1_df.isna().sum()

**TASKS**
1. Which area of interest/expertise is in abundance in Pakistan and where we need more people?
2. How many professors we have in Data Sciences, Artificial Intelligence, or Machine Learning?
3. Which country and university hosted majority of our teachers?
4. Which research areas were most common in Pakistan?
5. How does Pakistan Student to PhD Professor Ratio compare against rest of the world, especially with USA, India and China?
6. Any visualization and patterns you can generate from this data

We would solve the tasks one by one in following cells...

# Task 1 : Which area of interest/expertise is in abundance in Pakistan and where we need more people?

In [None]:
specializations = pd.DataFrame(copy1_df['specialization'])
# Number of null values in 'specialization' column
specializations['specialization'].isnull().sum()  

In [None]:
# Dropping the rows with null values
specializations.dropna(inplace=True) 

In [None]:
# converting into lower case
specializations['specialization'] = specializations.specialization.str.lower() 

# removing all periods '.'
specializations['specialization'] = specializations.specialization.str.replace('.', '') 

# removing all the 'and '
specializations['specialization'] = specializations.specialization.str.replace('and ', '') 

# As majority of the records in this cloumn have multiple areas of interets separated by commas ',' 
# Hence, splitting the records on the basis of commas ','
specializations['specialization'] = specializations.specialization.str.split(',')

In [None]:
# Now, making a list which contains area of interests each separately

area_Interest = []
for i in specializations['specialization']:
    for j in i:
            area_Interest.append(j.strip())

# Now, by using the list which contains area of interests each separately, make it's dataframe in rows
df_area_Interest = pd.DataFrame(area_Interest)

In [None]:
# Here, we are counting the similar area of interests by renaming the '0' column
df_area_Interest = df_area_Interest.rename(columns={0: 'area_of_interest'})
t1_frame = pd.DataFrame(df_area_Interest.area_of_interest.value_counts())

In [None]:
# By using reset_index() it will set the indices in order, starting from 0, and make it easier for us to work with the dataframe.
t1_frame = t1_frame.reset_index()

# changing columns names to make it more meaningful
t1_frame = t1_frame.rename(columns={'index': 'area_of_interest','area_of_interest': 'count'})

In [None]:
# Now, we are going to plot top 10 areas of interests

plt.figure(figsize=(15, 8))
sns.barplot(x=t1_frame.loc[0:9, 'area_of_interest'], y=t1_frame.loc[0:9, 'count'])
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel("Area of Interests", fontsize=18)
plt.ylabel("Count", fontsize=18)
plt.title("Top 10 Area of Interests", fontsize=20)
plt.show()

### **Hence, it depicts that Software Engineering is the foremost Area of Interest among the Pakistani Intellectuals.**

In [None]:
# Here, I have counted the areas of interest with only 1 intelectual working and considered them as an area which needs more number of intellectuals to work up on.
t1_frame.loc[t1_frame['count'] <= 1, 'need_more'] = True
t1_frame.loc[t1_frame['need_more'].isnull(), 'need_more'] = False 

# Just printing 5 of them there are many of the areas of interest
for i in range(1110, 1115):
    if  t1_frame['need_more'][i] == True:
        print("->",t1_frame['area_of_interest'][i], end="\n")


# Task 2 : How many professors we have in Data Sciences, Artificial Intelligence, or Machine Learning?

In [None]:
nDs = 0
nAi = 0
nMl = 0

for i in range(len(df_area_Interest)):
    if  df_area_Interest['area_of_interest'][i] == "data science":
        nDs +=1
    if  df_area_Interest['area_of_interest'][i] == "artificial intelligence":
        nAi +=1
    if  df_area_Interest['area_of_interest'][i] == "machine learning":
        nMl +=1
        
print("Number of Professors in Data Sciences: ",nDs)
print("Number of Professors in Artificial Intelligence: ",nAi)
print("Number of Professors in Machine Learning: ",nMl)

# Task 3 : Which country and university hosted majority of our teachers?

In [None]:
countries = pd.DataFrame(copy1_df['country'])
countries.head()

### So, here we found **NULL** values in the country's column, we are going to drop them.

In [None]:
# Dropping null values
countries.dropna(inplace=True)

In [None]:
# Checking the NULL values count
countries.isnull().sum()

In [None]:
# Replacing "Macau" with "China" and "Urbana" with "USA"
countries['country'] = countries['country'].str.strip()
countries.loc[countries['country'] == "Urbana", "country"] = "USA"
countries.loc[countries['country'] == "Macau", "country"] = "China"

In [None]:
# Here, we are counting the similar country's by renaming the '0' column
countries = countries.rename(columns={0: 'country'})
t3_frame = pd.DataFrame(countries.country.value_counts())

In [None]:
# By using reset_index() it will set the indices in order, starting from 0, and make it easier for us to work with the dataframe.
t3_frame = t3_frame.reset_index()

# changing columns names to make it more meaningful
t3_frame = t3_frame.rename(columns={'index': 'country','country': 'count'})

In [None]:
# Now, we are going to plot top 10 countries which hosted majority of our teachers

plt.figure(figsize=(15, 8))
sns.barplot(x=t3_frame.loc[0:9, 'country'], y=t3_frame.loc[0:9, 'count'])
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel("Country", fontsize=18)
plt.ylabel("Count", fontsize=18)
plt.title("Top 10 Countries which hosted majority of our teachers", fontsize=20)
plt.show()

### **Hence, it demonstrates that Pakistan hosted most of our Pakistani Intellectuals.**

In [None]:
university = pd.DataFrame(copy1_df['graduated_from'])
university.head()

### Here also, we found **NULL** values in the graduated from's column, we are going to drop them.

In [None]:
# Dropping null values
countries.dropna(inplace=True)

In [None]:
# Here, we are counting the similar universities by renaming the '0' column
university = university.rename(columns={0: 'university'})
t32_frame = pd.DataFrame(university.graduated_from.value_counts())

In [None]:
# By using reset_index() it will set the indices in order, starting from 0, and make it easier for us to work with the dataframe.
t32_frame = t32_frame.reset_index()

# changing columns names to make it more meaningful
t32_frame = t32_frame.rename(columns={'index': 'university','graduated_from': 'count'})
t32_frame

In [None]:
# Now, we are going to plot top 10 universities which hosted majority of our teachers

plt.figure(figsize=(15, 8))
sns.barplot(x=t32_frame.loc[0:9, 'university'], y=t32_frame.loc[0:9, 'count'])
plt.xticks(rotation=90, fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel("University", fontsize=18)
plt.ylabel("Number of Teachers", fontsize=18)
plt.title("Top 10 Universities which hosted majority of our teachers", fontsize=20)
plt.show()

### **Therefore, FAST NUCES, hosted most of our teachers.**

# Task 4 : Which research areas were most common in Pakistan?

In [None]:
plt.figure(figsize=(15, 8))
sns.barplot(x=t1_frame.loc[0:9, 'area_of_interest'], y=t1_frame.loc[0:9, 'count'])
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel("Area of Interests", fontsize=18)
plt.ylabel("Count", fontsize=18)
plt.title("Top 10 Area of Research Interests", fontsize=20)
plt.show()

### **The above bar chart shows that Software Engineering was the most common research interest for the Pakistani Intellectualls.**

# Task 5 : How does Pakistan Student to PhD Professor Ratio compare against rest of the world, especially with USA, India and China?

In [None]:
phd = pd.DataFrame(copy1_df['degree'])

In [None]:
# Checking the NULL values count
phd.isnull().sum()

### We are going to drop the **NULL** values in the degree column.

In [None]:
# Dropping null values
phd.dropna(inplace=True)

In [None]:
# Converting all the "phd" written in different formats in a uniform "Phd" format
phd.loc[phd['degree']=='PhD',           'degree'] = 'Phd'
phd.loc[phd['degree']=='Ph.D(Scholar)', 'degree'] = 'Phd'
phd.loc[phd['degree']=='Ph.D (Scholar)','degree'] = 'Phd'
phd.loc[phd['degree']=='Ph.D',          'degree'] = 'Phd'

In [None]:
phd.head()

In [None]:
phd.degree.value_counts().head()

### Moreover, according to HEC report, in **2014-2015** there are over **10,125** fulltime Ph.D. faculty teaching in Pakistan in all disciplines. Computer Science and related disciplines are widely taught in Pakistan with over 90 universities offering this discipline with qualified faculty. According to our dataset, there are 485 PhD faculty members in Computer Science in Pakistan for 10,000 students. So we have a PhD faculty member for every **20 students** on average in computer science program.

# Task 6 : Any visualization and patterns you can generate from this data

## $$$ Percentage of Faculty available in provinces across Pakistan

In [None]:
province = copy1_df['province'].value_counts()
plt.rcParams['font.size'] = 16
province.plot(kind='pie', figsize=(8,8), autopct='%1.0f%%')
plt.title("Province Wise Faculty Percentage")

### Therefore, **Punjab** lead this category by having **45%** faculty availability followed by **Sindh** which was **21%**.

## $$$ Top 10 universities with highest number of Faculty members

In [None]:
copy1_df['current_university'].value_counts()[:10].plot(kind="bar")

### Hence, **COMSATS Islamabad** has the most number of Faculty members.

## To conclude, in this Notebook;
### * I tried my best to complete the given tasks as a beginner.
### * I placed lot of comments and description so beginners like me can follow up.
### * I would also like to Thank and give credits to other contributers for sharing their work which helped me to carry forward.
### * **Upvote** the Notebook if you find it useful
 
## **Thank You**