feed.xml

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>A place to learn and share</title>
    <description>Personal page of data science amator</description>
    <link>http://ferodia.github.io/</link>
    <atom:link href="http://ferodia.github.io/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Sun, 24 Jul 2016 16:22:34 +0200</pubDate>
    <lastBuildDate>Sun, 24 Jul 2016 16:22:34 +0200</lastBuildDate>
    <generator>Jekyll v3.1.1</generator>
    
      <item>
        <title>Linkedin Data scraping with BeautifulSoup</title>
        <description>&lt;p&gt;Today I would like to do some web scraping of Linkedin job postings, I have two
ways to go:
 - Source code extraction
 - Using the Linkedin API&lt;/p&gt;

&lt;p&gt;I chose the first option, mainly because the API is poorly documented and I
wanted to experiment with BeautifulSoup.
BeautifulSoup in few words is a library that parses HTML pages and makes it easy
to extract the data.&lt;/p&gt;

&lt;p&gt;Official page: &lt;a href=&quot;https://www.crummy.com/software/BeautifulSoup/&quot;&gt;BeautifulSoup web page&lt;/a&gt;&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;c&quot;&gt;## Main packages needed are ulrlib2 to make url queries and beautifulSoup to structure the results&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;## the imports needed for this experiment&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;bs4&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BeautifulSoup&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;urllib2&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pd&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# get source code of the page&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get_url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urllib2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urlopen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# makes the source  tree format like &lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;beautify&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;source&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BeautifulSoup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;html.parser&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now that the functions are defined and libraries are imported, I’ll get job
postings of linkedin.&lt;br /&gt;
The inspection of the source code of the page shows indications where to access
elements we are interested in.&lt;br /&gt;
I basically achieved that by ‘inspecting elements’ using the browser.&lt;br /&gt;
I will look for “Data scientist” postings. Note that I’ll keep the quotes in my
search because otherwise I’ll get unrelevant postings containing the words
“Data” and “Scientist”.&lt;br /&gt;
Below we are only interested to find div element with class ‘results-context’,
which contains summary of the search, especially the number of items found.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;jobs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;beautify&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;https://www.linkedin.com/jobs/search?keywords=%22Data+Scientist%22&amp;amp;&amp;#39;&lt;/span&gt;
                &lt;span class=&quot;s&quot;&gt;&amp;#39;location=France&amp;amp;trk=jobs_jserp_search_button_execute&amp;amp;orig=JSERP&amp;amp;locationId=fr%3A0&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;results_context&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jobs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;div&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;class&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;results-context&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;strong&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;n_jobs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;results_context&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; 
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;###### Number of job postings #######&amp;quot;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_jobs&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;#####################################&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;    ###### Number of job postings #######
    93
    #####################################&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now let’s check the number of postings we got on one page&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;results&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jobs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find_all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;li&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;class&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;job-listing&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;n_postings&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;results&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;#### Number of job postings per page ####&amp;quot;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_postings&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;#########################################&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;    #### Number of job postings per page ####
    25
    #########################################&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;#### Number of pages ####&amp;quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;n_pages&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;round&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_jobs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_postings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_pages&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;#########################&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;    #### Number of pages ####
    4
    #########################&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;To be able to extract all postings, I need to iterate over the pages, therefore
I will proceed with examining the urls of the different pages to work out the
logic.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;url of the first page&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;https://www.linkedin.com/jobs/search?keywords=Data+Scientist&amp;amp;locationId=fr:0&amp;amp;s
tart=0&amp;amp;count=25&amp;amp;trk=jobs_jserp_pagination_1&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;second page&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;https://www.linkedin.com/jobs/search?keywords=Data+Scientist&amp;amp;locationId=fr:0&amp;amp;s
tart=25&amp;amp;count=25&amp;amp;trk=jobs_jserp_pagination_2&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;third page&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;https://www.linkedin.com/jobs/search?keywords=Data+Scientist&amp;amp;locationId=fr:0&amp;amp;s
tart=50&amp;amp;count=25&amp;amp;trk=jobs_jserp_pagination_3&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;there are two elements changing :&lt;br /&gt;
- start=25 which is a product of page number and 25&lt;br /&gt;
- trk=jobs_jserp_pagination_3&lt;/p&gt;

&lt;p&gt;I also noticed that the pagination number doesn’t have to be changed to go to
next page, which means I can change only start value to get the next postings
(may be Linkedin developers should do something about it …)&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;titles&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;companies&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;locations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;links&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#loop over all pages to get the posting details&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_pages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;    
    &lt;span class=&quot;c&quot;&gt;# define the base url for generic searching &lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;http://www.linkedin.com/jobs/search?keywords=%22Data+Scientist%22&amp;amp;locationId=&amp;quot;&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&amp;quot;fr:0&amp;amp;start=nPostings&amp;amp;count=25&amp;amp;trk=jobs_jserp_pagination_1&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;nPostings&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;25&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;soup&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;beautify&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# Build lists for each type of information&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;results&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;soup&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find_all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;li&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;class&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;job-listing&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;results&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;c&quot;&gt;# print &amp;quot;there are &amp;quot;, len(results) , &amp;quot; results&amp;quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;results&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;# set only the value if get_text() &lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;titles&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;h2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;span&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;h2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;span&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;None&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;companies&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;span&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;class&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;company-name-text&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;
                         &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;span&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;class&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;company-name-text&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;None&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;locations&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;span&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;class&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;job-location&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;
                        &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;span&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;class&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;job-location&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;None&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;find&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;a&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;class&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;job-title-link&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;href&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As I mentioned above, all the information about where to find the job details
are made easy thanks to source code viewing via any browser&lt;/p&gt;

&lt;p&gt;Next, it’s time to create the data frame&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;jobs_linkedin&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;title&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;titles&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;company&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;companies&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;location&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;locations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;link&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now the table is filled with the above columns. &lt;br /&gt;
Just to verify, I can check the size of the table to make sure I got all the
postings&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;jobs_linkedin&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;    company     93
    link        93
    location    93
    title       93
    dtype: int64&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In the end, I got an actual dataset just by scraping web pages. Gathering data
never have been as easy.
I can even go further by parsing the description of each posting page and
extract information like:&lt;br /&gt;
- Level&lt;br /&gt;
- Description&lt;br /&gt;
- Technologies&lt;br /&gt;
…&lt;/p&gt;

&lt;p&gt;There are no limits to which extent we can exploit the information in HTML pages
thanks to BeautifulSoup, you just have to read the documentation which is very
good by the way, and get to practice on real pages.&lt;/p&gt;

&lt;p&gt;Ciao!&lt;/p&gt;
</description>
        <pubDate>Sat, 28 May 2016 12:57:54 +0200</pubDate>
        <link>http://ferodia.github.io/linkedin-data-scraping-with-beautifulsoup</link>
        <guid isPermaLink="true">http://ferodia.github.io/linkedin-data-scraping-with-beautifulsoup</guid>
        
        
      </item>
    
      <item>
        <title>Data analysis made simple</title>
        <description>&lt;p&gt;While roaming around looking for data to explore I came across this dataset in Kaggle website.
The data set contains information about the animals admitted in the shelter and the purpose is to predict their outcome.&lt;/p&gt;

&lt;p&gt;But before we get to that let’s explore the files and get to know the features.&lt;/p&gt;

&lt;h2 id=&quot;warm-up&quot;&gt;Warm up&lt;/h2&gt;

&lt;p&gt;To read the csv files we need the library readr&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;readr&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;em&gt;If you don’t have the library available you need to install it&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now let’s read the files and have a look at what we have.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;animals &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; read_csv&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;file&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;train.csv&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kp&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;animals&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##   AnimalID    Name            DateTime     OutcomeType OutcomeSubtype
## 1  A671945 Hambone 2014-02-12 18:22:00 Return_to_owner           &amp;lt;NA&amp;gt;
## 2  A656520   Emily 2013-10-13 12:44:00      Euthanasia      Suffering
## 3  A686464  Pearce 2015-01-31 12:28:00        Adoption         Foster
## 4  A683430    &amp;lt;NA&amp;gt; 2014-07-11 19:09:00        Transfer        Partner
## 5  A667013    &amp;lt;NA&amp;gt; 2013-11-15 12:52:00        Transfer        Partner
## 6  A677334    Elsa 2014-04-25 13:04:00        Transfer        Partner
##   AnimalType SexuponOutcome AgeuponOutcome
## 1        Dog  Neutered Male         1 year
## 2        Cat  Spayed Female         1 year
## 3        Dog  Neutered Male        2 years
## 4        Cat    Intact Male        3 weeks
## 5        Dog  Neutered Male        2 years
## 6        Dog  Intact Female        1 month
##                               Breed       Color
## 1             Shetland Sheepdog Mix Brown/White
## 2            Domestic Shorthair Mix Cream Tabby
## 3                      Pit Bull Mix  Blue/White
## 4            Domestic Shorthair Mix  Blue Cream
## 5       Lhasa Apso/Miniature Poodle         Tan
## 6 Cairn Terrier/Chihuahua Shorthair   Black/Tan&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We can also check the dimension of the dataset&lt;br /&gt;
26729, 10
Some processing to convert some columns to factors, since we have many of them we’ll use the magic lapply.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;factors &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;OutcomeType&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;OutcomeSubtype&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;AnimalType&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;AgeuponOutcome&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;SexuponOutcome&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Breed&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Color&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
animals&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;factors&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;lapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;animals&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;factors&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;FUN &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;as.factor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;know-your-data&quot;&gt;Know your data&lt;/h2&gt;

&lt;p&gt;I will proceed with some explorations to get to know the kind of information the dataset possesses. &lt;br /&gt;
Summary is a very useful to check basic information about the data frame.
It also shows that we hve some NA, “Other”, “Unknown” values which might be a problem to get relevant statistical results and machine learning models.&lt;/p&gt;

&lt;p&gt;That’s why I will start by some data observation and processign when needed to have the table more harmonised.&lt;/p&gt;

&lt;h3 id=&quot;age&quot;&gt;Age&lt;/h3&gt;

&lt;p&gt;The first observation is tht the age is expressed in various “units”:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;levels&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;AgeuponOutcome&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##  [1] &amp;quot;0 years&amp;quot;   &amp;quot;10 months&amp;quot; &amp;quot;10 years&amp;quot;  &amp;quot;11 months&amp;quot; &amp;quot;11 years&amp;quot; 
##  [6] &amp;quot;12 years&amp;quot;  &amp;quot;13 years&amp;quot;  &amp;quot;14 years&amp;quot;  &amp;quot;15 years&amp;quot;  &amp;quot;16 years&amp;quot; 
## [11] &amp;quot;17 years&amp;quot;  &amp;quot;18 years&amp;quot;  &amp;quot;19 years&amp;quot;  &amp;quot;1 day&amp;quot;     &amp;quot;1 month&amp;quot;  
## [16] &amp;quot;1 week&amp;quot;    &amp;quot;1 weeks&amp;quot;   &amp;quot;1 year&amp;quot;    &amp;quot;20 years&amp;quot;  &amp;quot;2 days&amp;quot;   
## [21] &amp;quot;2 months&amp;quot;  &amp;quot;2 weeks&amp;quot;   &amp;quot;2 years&amp;quot;   &amp;quot;3 days&amp;quot;    &amp;quot;3 months&amp;quot; 
## [26] &amp;quot;3 weeks&amp;quot;   &amp;quot;3 years&amp;quot;   &amp;quot;4 days&amp;quot;    &amp;quot;4 months&amp;quot;  &amp;quot;4 weeks&amp;quot;  
## [31] &amp;quot;4 years&amp;quot;   &amp;quot;5 days&amp;quot;    &amp;quot;5 months&amp;quot;  &amp;quot;5 weeks&amp;quot;   &amp;quot;5 years&amp;quot;  
## [36] &amp;quot;6 days&amp;quot;    &amp;quot;6 months&amp;quot;  &amp;quot;6 years&amp;quot;   &amp;quot;7 months&amp;quot;  &amp;quot;7 years&amp;quot;  
## [41] &amp;quot;8 months&amp;quot;  &amp;quot;8 years&amp;quot;   &amp;quot;9 months&amp;quot;  &amp;quot;9 years&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I counted 44 units, in order to be able to use this information it should be expressed with the same unit, I chose the smallest unit existing which is “day”.
for each row I will apply a transformation by converting “week”,”month”,”year”,and “day”, 
First the function is defined&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kn&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;stringr&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
convertAge &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;age&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# first extract the digits&lt;/span&gt;
  regexp &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;[[:digit:]]+&amp;quot;&lt;/span&gt;
  result &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;
  digits &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;strtoi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;str_extract&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;age&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; regexp&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
  &lt;span class=&quot;kr&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;day&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;age&lt;span class=&quot;p&quot;&gt;)){&lt;/span&gt;
    result &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; digits
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;week&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;age&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    result &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; digits&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;7&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;month&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;age&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    result &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; digits&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;30&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;grepl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;year&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;age&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    result &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; digits&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;365&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;kr&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;result&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;then I apply the conversion function to each row of the column age&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt; animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;AgeuponOutcome &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;sapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;AgeuponOutcome&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;convertAge&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;dogs-vs-cats&quot;&gt;Dogs vs Cats&lt;/h3&gt;

&lt;p&gt;Here we will explore the correlation between the fate of the animal and its type.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/ferodia.github.io/images/markdown_files/figure-markdown_github/unnamed-chunk-7-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;From the plot, it seems that the animal type impacts somehow the outcome.
To make sure this sample is not skewed by dominance of one type over another let’s check first the distribution&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## 
##       Cat       Dog 
## 0.4165513 0.5834487&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The distribution is not perfectly balanced because Dogs represent 58% of the animals.
The outcome depends on the animal type&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##      
##        Adoption      Died Euthanasia Return_to_owner  Transfer
##   Cat 0.3966942 0.7461929  0.4565916       0.1044714 0.5842709
##   Dog 0.6033058 0.2538071  0.5434084       0.8955286 0.4157291&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;male-vs-female&quot;&gt;Male vs Female&lt;/h3&gt;

&lt;p&gt;In this part we are more interested in the gender, which in this case seems to be divided in 4 types:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## [1] &amp;quot;Intact Female&amp;quot; &amp;quot;Intact Male&amp;quot;   &amp;quot;Neutered Male&amp;quot; &amp;quot;Spayed Female&amp;quot;
## [5] &amp;quot;Unknown&amp;quot;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;And “Unknown”, that can be any of the other 4.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/ferodia.github.io/images/markdown_files/figure-markdown_github/unnamed-chunk-11-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;##                
##                   Adoption       Died Euthanasia Return_to_owner  Transfer
##   Intact Female 0.01885040 0.28426396 0.25787781     0.062904911 0.2706432
##   Intact Male   0.01467174 0.40101523 0.30675241     0.099686520 0.2477181
##   Neutered Male 0.48491039 0.09644670 0.22122186     0.469592476 0.2066440
##   Spayed Female 0.48156746 0.09137056 0.14919614     0.365308255 0.1736362
##   Unknown       0.00000000 0.12690355 0.06495177     0.002507837 0.1013585&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The challenge is to fill the unknown with the right values : female (spayed, or intact), male (neutered, or intact), we will use basic knowledge as well as the other features.&lt;/p&gt;

&lt;p&gt;The first thing to try is to infer the gender based on color using the following golden rule:&lt;br /&gt;
*For genetical reasons, only females are calico, which means they have three colors (white, orange and black), they can happen to be male, but this means they have a genetic anomaly (XXY chromosomes), but I won’t go that far.&lt;br /&gt;
Some numbers to prove my point:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;SexuponOutcome&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;animals&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Color&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Calico&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## 
## Intact Female   Intact Male Neutered Male Spayed Female       Unknown 
##           198             3             1           296            19&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;when I display the count of cats that are “Calico” per gender, out of more than 400 cats, only 4 of them are male. Therefore the assumption that the 19 unknown are female is not harm the statistics.
Now the main problem remains what kind of female ? Intact or spayed ?&lt;/p&gt;

&lt;p&gt;At first, I can derive some intuition: the animals are born intact, and are spayed/neutered at some point in their life which should not happen before some age, for example a cat who is 1 week is too young to be spayed and vice versa, an old cat is more likely to have been spayed already, let’s plot age as a function of gender to verify this theory.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/ferodia.github.io/images/markdown_files/figure-markdown_github/unnamed-chunk-13-1.png&quot; alt=&quot;center&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Until the age of 30 days, the neutered/spayed animals are inexistent, which makes sense from a scientific point o view because the animals are too young. The threshold I will use is 30.&lt;br /&gt;
The following conclusion is drawn
&lt;strong&gt;The Calico cats under the age of 30 days are all intact females&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s apply it&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;SexuponOutcome&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;AgeuponOutcome &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;30&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;SexuponOutcome &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Unknown&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Intact Female&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The other part of data with unknown gender could be useful to predict the outcome of the animal&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;span class=&quot;kp&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;OutcomeType&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;animals&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;SexuponOutcome&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Unknown&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;## 
##        Adoption            Died      Euthanasia Return_to_owner 
##               0               8              57              10 
##        Transfer 
##             322&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;For example, animal of unknown type will never be adopted.&lt;/p&gt;

&lt;h3 id=&quot;feature-engineering&quot;&gt;Feature engineering&lt;/h3&gt;

&lt;p&gt;Now let’s move on to create new features.&lt;/p&gt;

&lt;h2 id=&quot;hasname&quot;&gt;HasName&lt;/h2&gt;

&lt;p&gt;To simply the processing of name, the characters themseves are not useful in the context of learning and outcome prediction. However the presence is important. It means most of the time that the animal belonged to somebody how it a certain name.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;hasName &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;sapply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;Name&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;FUN &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;is.na&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;x&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;hasName &lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;factor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;animals&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;hasName&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;That’s all for today. See you next time!&lt;/p&gt;

&lt;p&gt;Ciao!&lt;/p&gt;

</description>
        <pubDate>Sun, 01 May 2016 19:46:52 +0200</pubDate>
        <link>http://ferodia.github.io/blog/2016/Data-Analysis-Animals/</link>
        <guid isPermaLink="true">http://ferodia.github.io/blog/2016/Data-Analysis-Animals/</guid>
        
        
        <category>jekyll</category>
        
        <category>update</category>
        
      </item>
    
  </channel>
</rss>